A  Scalable  Distributed  Syntactic,  Semantic, 
and  Lexical  Language  Model 


Ming  Tan* 

Wright  State  University 

Wenli  Zhou** 

Wright  State  University 

Lei  Zheng' 

Wright  State  University 

Shaojun  Wang 

Wright  State  University 


This  paper  presents  an  attempt  at  building  a  large  scale  distributed  composite  language  model 
that  is  formed  by  seamlessly  integrating  an  n-gram  model,  a  structured  language  model,  and 
probabilistic  latent  semantic  analysis  under  a  directed  Markov  random  field  paradigm  to  simul¬ 
taneously  account  for  local  word  lexical  information,  mid-range  sentence  syntactic  structure, 
and  long-span  document  semantic  content.  The  composite  language  model  has  been  trained  by 
performing  a  convergent  N-best  list  approximate  EM  algorithm  and  afollozv-up  EM  algorithm 
to  improve  ivord  prediction  power  on  corpora  with  up  to  a  billion  tokens  and  stored  on  a 
supercomputer.  The  large  scale  distributed  composite  language  model  gives  drastic  perplexity 
reduction  over  n-grams  and  achieves  significantly  better  translation  quality  measured  by  the 
Bleu  score  and  "readability"  of  translations  when  applied  to  the  task  of  re-ranking  the  N-best  list 
from  a  state-of-the-art  parsing-based  machine  translation  system. 

1.  Introduction 

The  Markov  chain  (n-gram)  source  models,  which  predict  each  word  on  the  basis  of  the 
previous  n  —  1  words,  have  been  the  workhorses  of  state-of-the-art  speech  recognizers 
and  machine  translators  that  help  to  resolve  acoustic  or  foreign  language  ambiguities  by 
placing  higher  probability  on  more  likely  original  underlying  word  strings.  Although 
the  Markov  chains  are  efficient  at  encoding  local  word  interactions,  the  n-gram  model 
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clearly  ignores  the  rich  syntactic  and  semantic  structures  that  constrain  natural  lan¬ 
guages.  Attempting  to  increase  the  order  of  an  //-gram  to  capture  longer  range  depen¬ 
dencies  in  natural  language  immediately  runs  into  the  curse  of  dimensionality  (Bengio 
et  al.  2003).  The  performance  of  conventional  //-gram  technology  has  essentially  reached 
a  plateau  (Rosenfeld  2000b;  Zhang  2008),  and  it  has  proven  remarkably  difficult  to 
improve  on  //-grams  (Jelinek  1991;  Jelinek  and  Chelba  1999).  Research  groups  (Och  2005; 
Zhang,  Hildebrand,  and  Vogel  2006;  Brants  et  al.  2007;  Emami,  Papineni,  and  Sorensen 
2007)  have  shown  that  using  an  immense  distributed  computing  paradigm,  up  to 
6-grams,  can  be  trained  on  up  to  billions  and  trillions  of  tokens,  yielding  consistent  sys¬ 
tem  improvements  because  of  excellent  //-gram  hit  ratios  on  unseen  test  data,  but  Zhang 
(2008)  did  not  observe  much  improvement  beyond  6-grams.  As  the  machine  translation 
(MT)  working  groups  stated  in  their  final  report  (Lavie  et  al.  2006,  page  3),  "These 
approaches  have  resulted  in  small  improvements  in  MT  quality,  but  have  not  funda¬ 
mentally  solved  the  problem.  There  is  a  dire  need  for  developing  novel  approaches  to 
language  modeling." 

Over  the  past  two  decades,  more  sophisticated  models  have  been  developed  that 
outperform  //-grams;  these  are  mainly  the  syntactic  language  models  (Della  Pietra  et  al. 
1994;  Chelba  2000;  Chelba  and  Jelinek  2000;  Chamiak  2001;  Roark  2001;  Wang  and 
Harper  2002;  Jelinek  2004;  Benedi  and  Sanchez  2005;  Van  Uytsel  and  Compemolle  2005) 
that  effectively  exploit  sentence-level  syntactic  structure  of  natural  language,  and  the 
topic  language  models  (Saul  and  Pereira  1997;  Gildea  and  Hofmann  1999;  Bellegarda 
2000;  Wallach  2006)  that  exploit  document-level  semantic  content.  Unfortunately,  each 
of  these  language  models  only  targets  some  specific,  distinct  linguistic  phenomena 
(Pereira  2000;  Rosenfeld  2000a,  2000b);  thus,  each  captures  and  exploits  different  aspects 
of  natural  language  regularity.  A  natural  question  we  should  ask  is  whether/how 
we  can  construct  more  complex  and  powerful  but  computationally  tractable  language 
models  by  integrating  many  existing/ emerging  language  model  components,  with  each 
component  focusing  on  specific  linguistic  phenomena  like  syntactic  structure,  semantic 
topic,  morphology,  and  pragmatics  in  complementary,  supplementary,  and  coherent 
ways  (Bellegarda  2001,  2003). 

Several  techniques  for  combining  language  models  have  been  investigated.  The 
most  commonly  used  method  is  linear  interpolation  (Chen  and  Goodman  1999;  Jelinek 
and  Mercer  1980;  Goodman  2001),  where  each  individual  model  is  trained  separately 
and  then  combined  by  a  weighted  linear  combination.  All  of  the  syntactic  structure- 
based  models  have  used  linear  interpolation  to  combine  trigrams  to  achieve  further 
improvement  over  using  their  own  models  alone  (Charniak  2001;  Chelba  and  Jelinek 
2000;  Chelba  2000;  Roark  2001).  The  weights  in  this  case  are  trained  using  held-out 
data.  Even  though  this  technique  is  simple  and  easy  to  implement,  it  does  not  generally 
yield  very  effective  combinations  (Rosenfeld  1996)  because  the  linear  additive  form 
is  a  strong  assumption  in  capturing  subtleties  in  each  of  the  component  models  (see 
more  explanation  and  analysis  in  Section  6.2  and  Appendix  A).  The  second  method 
is  based  on  maximum  entropy  philosophy,  which  became  very  popular  in  machine 
learning  and  natural  language  processing  communities  due  to  the  work  in  Berger, 
Della  Pietra,  and  Della  Pietra  (1996),  Della  Pietra,  Della  Pietra,  and  Lafferty  (1997), 
Lau  et  al.  (1993)  and  Rosenfeld  (1996).  In  fact,  for  a  complete  data  case,  maximum 
entropy  is  nothing  but  maximum  likelihood  estimation  for  undirected  Markov  random 
fields  (MRFs)  (Berger,  Della  Pietra,  and  Della  Pietra  1996;  Della  Pietra,  Della  Pietra, 
and  Lafferty  1997).  As  stated  in  Wang  et  al.  (2005b),  however,  there  are  two  weaknesses 
with  maximum  entropy  approach.  The  first  weakness  is  that  this  approach  can  only 
model  distributions  over  explicitly  observed  features,  but  we  know  there  is  hidden 
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information  in  natural  language,  such  as  syntactic  structure  and  semantic  topic.  The 
second  weakness  is  that  if  the  statistical  model  is  too  complex  it  becomes  intractable  to 
estimate  model  parameters;  computationally  very  expensive  Markov  chain  Monte  Carlo 
sampling  methods  (Mark,  Miller,  and  Grenander  1996;  Rosenfeld  2000b;  Rosenfeld, 
Chen,  and  Zhu  2001)  would  have  to  be  used.  One  way  to  overcome  the  first  hurdle 
is  to  use  a  preprocessing  tool  to  extract  hidden  features  (e.g.,  Rosenfeld  [1996]  used 
mutual  information  clustering  method  to  find  word  pair  triggers)  then  combine  these 
triggers  with  trigrams  through  a  maximum  conditional  entropy  approach  to  allow  the 
discourse  topic  to  influence  word  prediction;  Khudanpur  and  Wu  (2000)  used  Chelba 
and  Jelinek's  structured  language  model  and  a  word  clustering  model  to  extract  relevant 
grammatical  and  semantic  features,  then  to  again  combine  these  features  with  trigrams 
through  a  maximum  conditional  entropy  approach  to  form  a  syntactic,  semantic,  and 
lexical  language  model.  Wang  and  colleagues  (Wang  et  al.  2005a;  Wang,  Schuurmans, 
and  Zhao  2012)  have  proposed  the  latent  maximum  entropy  (LME)  principle,  which 
extends  standard  maximum  entropy  estimation  by  incorporating  hidden  dependency 
structure,  but  still  the  LME  wouldn't  overcome  the  second  hurdle.  The  third  method  is 
directed  Markov  random  field  (Wang  et  al.  2005b)  that  overcomes  both  weaknesses  in 
the  maximum  entropy  approach.  Wang  et  al.  used  this  approach  to  combine  trigram, 
probabilistic  context-free  grammar  (PCFG),  and  probabilistic  latent  semantic  analysis 
(PLSA)  models;  a  generalized  inside-outside  algorithm  is  derived  that  alters  the  well- 
known  inside-outside  algorithm  for  PCFG  (Baker  1979;  Lari  and  Young  1990)  with 
modular  modification  to  take  into  account  the  effect  of  //-gram  and  PLSA  while  remain¬ 
ing  at  the  same  cubic  time  complexity.  When  applying  this  to  the  Wall  Street  Journal 
corpus  with  40  million  tokens,  they  achieved  moderate  perplexity  reduction.  Because 
the  probabilistic  dependency  structure  in  a  structured  language  model  (SLM)  (Chelba 
2000;  Chelba  and  Jelinek  2000)  is  more  complex  and  powerful  than  that  in  a  PCFG, 
Wang  et  al.  (2006)  studied  the  stochastic  properties  for  the  composite  language  model 
that  integrates  //-gram,  SLM,  and  PLSA  under  the  directed  MRF  framework  (Wang  et  al. 
2005b)  and  derived  another  generalized  inside-outside  algorithm  to  train  a  composite  n- 
gram,  SLM,  and  PLSA  language  model  from  a  general  expectation  maximization  (EM) 
(Dempster,  Laird,  and  Rubin  1977)  algorithm  by  following  Jelinek's  ingenious  definition 
of  the  inside  and  outside  probabilities  for  SLM  (Jelinek  2004).  Again,  the  generalized 
inside-outside  algorithm  alters  Jelinek's  inside-outside  algorithm  with  modular  modi¬ 
fication  and  has  the  same  sixth  order  of  sentence-length  time  complexity.  Unfortunately, 
there  are  no  experimental  results  reported. 

In  this  article,  we  study  the  same  composite  //-gram,  SLM,  and  PLSA  model  un¬ 
der  the  directed  MRF  framework  as  in  Wang  et  al.  (2006).  The  composite  //-gram/ 
SLM/PLSA  language  model  under  the  directed  MRF  paradigm  is  first  introduced  in 
Section  2.  In  Section  3,  instead  of  using  the  sixth  order  generalized  inside-outside 
algorithm  proposed  in  Wang  et  al.  (2006),  we  show  how  to  train  this  composite  model 
via  an  N-best  list  approximate  EM  algorithm  that  has  linear  time  complexity  and  a 
follow-up  EM  algorithm  to  improve  word  prediction  power.  We  prove  the  convergence 
of  the  N-best  list  approximate  EM  algorithm.  To  resolve  the  data  sparseness  problem, 
we  generalize  Jelinek  and  Mercer's  recursive  mixing  scheme  for  Markov  source  (Jelinek 
and  Mercer  1980)  to  a  mixture  of  Markov  chains.  To  handle  large-scale  corpora  up  to  a 
billion  tokens,  we  demonstrate  how  to  implement  these  algorithms  under  a  distributed 
computing  environment  and  how  to  store  this  language  model  on  a  supercomputer.  In 
Section  4,  we  describe  how  to  use  the  model  for  testing.  Related  works  are  then  summa¬ 
rized  and  compared  in  Section  5.  Because  language  modeling  is  a  data-rich  and  feature- 
rich  density  estimation  problem,  there  is  always  a  trade-off  between  approximate  error 
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and  estimation  error,  thus  in  Section  6  we  conduct  comprehensive  experiments  on 
corpora  with  44  million  tokens,  230  million  tokens,  and  1.3  billion  tokens,  and  compare 
perplexity  results  with  //-grams  (n  =  3,  4,  5  respectively)  on  these  three  corpora  under 
various  situations;  drastic  perplexity  reductions  are  obtained.  We  explain  why  the  com¬ 
posite  language  models  lead  to  better  predictive  capacity  than  linear  interpolation.  The 
proposed  composite  language  models  are  applied  to  the  task  of  re-ranking  the  N-best 
list  from  Hiero  (Chiang  2005,  2007),  a  state-of-the-art  parsing-based  machine  translation 
system;  we  achieve  significantly  better  translation  quality  measured  by  the  Bleu  score 
and  "readability"  of  translations.  Finally,  we  draw  our  conclusions  and  propose  future 
work  in  Section  7. 

The  main  theme  of  our  approach  is  "to  exploit  information,  be  it  syntactic  structure 
or  semantic  fabric,  which  involves  a  fairly  high  degree  of  cognition.  This  is  precisely 
the  kind  of  knowledge  that  humans  naturally  and  inherently  use  to  process  natural 
language,  so  it  can  be  reasonably  conjectured  to  represent  a  key  ingredient  for  success" 
(Bellegarda  2003,  p.  105).  In  that  light,  the  directed  MRF  framework,  "whose  ultimate 
goal  is  to  integrate  all  available  knowledge  sources,  appears  most  likely  to  harbor  a 
potential  breakthrough.  It  is  hoped  that  the  on-going  effort  conducted  in  this  work  to 
leverage  such  latent  synergies  will  lead,  in  the  not-too-distant  future,  to  more  polyva¬ 
lent,  multi-faceted,  effective  and  tractable  solutions  for  language  modeling  -  this  is  only 
beginning  to  scratch  the  surface  in  developing  systems  capable  of  deep  understanding 
of  natural  language"  (Bellegarda  2003,  p.  105). 


2.  The  Composite  n-gram/SLM/PLSA  Language  Model 

Let  X  denote  a  set  of  random  variables  (XT)Tep  taking  values  in  a  (discrete)  probability 
space  (Xr)reV/  where  T  is  a  finite  set  of  states.  We  define  a  (discrete)  directed  Markov 
random  field  to  be  a  probability  distribution  V,  which  admits  a  recursive  factorization 
if  there  exist  non-negative  functions,  kt(-,  •),  t  6  T  defined  on  Xx  x  such  that 

kT(*t,  xpa(r))  —  1  and  V  has  density 


p(x)  =  n  Kr(xr,xpa{r)) 

rer 


(1) 


Here  pfl(x)  denotes  the  set  of  parent  states  of  t.  If  the  recursive  factorization  refers  to  a 
graph,  then  we  have  a  Bayesian  network  (Lauritzen  1996).  Broadly  speaking,  however, 
the  recursive  factorization  can  refer  to  a  representation  more  complicated  than  a  graph 
with  a  fixed  set  of  nodes  and  edges — for  example,  PCFG  and  SLM  are  examples  of 
directed  MRFs  whose  parse  tree  structure  is  a  random  object  that  can't  be  described 
as  a  Bayesian  network  (Me Allester,  Collins,  and  Pereira  2004).  A  key  difference  be¬ 
tween  directed  MRFs  and  undirected  MRFs  is  that  a  directed  MRF  requires  many 
local  normalization  constraints  whereas  an  undirected  MRF  has  a  global  normalization 
factor. 

The  n-gram  (Jelinek  1998;  Jurafsky  and  Martin  2008)  language  model  is  essentially  a 
WORD-PREDICTOR,  that  is,  given  its  entire  document  history,  it  predicts  the  next  word 
a/jt+i  £  V  based  on  the  last  n- 1  words  with  probability  p{wk+i  \zv£_n,2)  where  w\_n+2  — 
wk_n+2,  ■  ■  ■  ,wk  and  V  denotes  the  vocabulary. 

The  SLM  proposed  in  Chelba  and  Jelinek  (1998, 2000)  and  Chelba  (2000)  uses  syntac¬ 
tic  information  beyond  the  regular  n-gram  models  to  capture  sentence-level  long-range 
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dependencies.  The  SLM  is  based  on  statistical  parsing  techniques  that  allow  syntactic 
analysis  of  sentences;  it  assigns  a  probability  p(  W,  T)  to  every  sentence  W  and  every 
possible  binary  parse  T.  The  terminals  of  T  are  the  words  of  W  with  part  of  speech 
(POS)  tags,  and  the  nodes  of  T  are  annotated  with  phrase  headwords  and  non-terminal 
labels.  Let  W  be  a  sentence  of  length  n  words  to  which  we  have  prepended  the  sentence 
beginning  marker  (s)  and  appended  the  sentence  end  marker  (/ s }  so  that  Wg  —(s')  and 
wn+\  ~(/s).  Let  Wk  =  ®o,  •  •  •  ,  wk  be  the  word  k- prefix  of  the  sentence  (the  words  from 
the  beginning  of  the  sentence  up  to  the  current  position  k)  and  WkTk  be  the  word-parse 
/c-prcfix.  A  word-parse  /c-prefix  has  a  set  of  exposed  heads  /z_m,  •  •  •  ,  h_i  £  with  each 
head  being  a  pair  (headword,  non-terminal  label),  T~L  —  V  x  C^T  where  CUT  denotes 
the  set  of  non-terminal  label  (NTlabel),  or  in  the  case  of  a  root-only  tree  (word,  POS  tag) 
Ti.  —  V  x  O  where  O  denotes  the  set  of  POS  tags.  The  exposed  heads  at  a  given  position 
k  in  the  input  sentence  are  a  function  of  the  word-parse  /(-prefix. 

The  SLM  operates  left-to-right,  building  up  the  parse  structure  in  a  bottom-up 
manner.  At  any  given  stage  of  the  word  generation  by  the  SLM,  the  exposed  headwords 
are  those  headwords  of  the  current  partial  parse  which  are  not  yet  part  of  a  higher 
phrase  with  a  head  of  its  own.  An  mth  order  SLM  (wz-SLM)  has  three  operators  to 
generate  a  sentence: 

•  The  WORD-PREDICTOR  predicts  the  next  word  wk+-\  £  V  based  on  the  m 
most  recently  exposed  headwords  hz]n  —  h_m,  ■  ■  ■  ,  h _  j  in  the  word-parse 
/(-prefix  with  probability  p{wk+i  |/Cj,),  and  then  passes  control  to  the 
TAGGER. 

•  The  TAGGER  predicts  the  POS  tag  tk+1  £  O  to  the  next  word  ivk+1  based 
on  the  next  word  wk+1  and  the  POS  tags  of  the  rn  most  recently  exposed 
headwords  hz]n  (denoted  as  hzj„. tag  =  /z_m.tag,  •  •  •  ,h_\. tag)  in  the 
word-parse  k-prefix  with  probability  p(tk+i\wk+i,hZlm -tag). 

•  The  CONSTRUCTOR  builds  the  partial  parse  Tk+i  from  Tk,  zvk+  j ,  and  tk+i 
in  a  series  of  moves  ending  with  NULL,  where  a  parse  move  a  is  made 
with  probability  p(a\hZlm)',  a  £  A={  (unary,  NTlabel),  (adjoin-left,  NTlabel), 
(adjoin-right,  NTlabel),  NULL}.  Depending  on  an  action  a  —  adjoin-right 
or  adjoin-left,  the  headword  h_i  or  h_ 2  is  percolated  up  by  one  tree  level, 
the  indices  of  the  current  exposed  headwords  /*_3,/z_4,  •  •  •  are  increased 
by  1,  and  these  headwords  together  with  h_i  or  h_2  become  the  new 
exposed  headwords.  Once  the  CONSTRUCTOR  hits  NULL,  the 
headword  indexing  and  current  parse  structure  remain  as  they  are, 

and  the  CONSTRUCTOR  passes  control  to  the  WORD-PREDICTOR. 

SLM  is  thus  essentially  a  generalization  of  a  shift-reduce  parser  (Aho  and  Ullman 
1972)  with  adjoin  corresponding  to  reduce  and  predict  to  shift.  (See  a  detailed  description 
about  SLM  in  Chelba  and  Jelinek  [1998,  2000];  Chelba  [2000];  Jelinek  [2004]).  As  an 
example  taken  from  Jelinek  (2004),  Figure  1  shows  a  complete  parse  where  SB/SE  is  a 
distinguished  POS  tag  for  (s)/(/s)  respectively,  ((s),TOP)  is  the  only  allowed  head,  and 
((/s),TOP')  is  the  head  of  any  constituent  that  dominates  (/s)  but  not  (s).  In  Figure  1, 
at  the  time  just  after  the  word  as  is  generated,  the  exposed  headwords  are  "(s)  SB, 
show  rip,  has  vbz."  The  subsequent  model  actions  are:  "POS tag  as,  null,  predict  its, 
POStag  its,  null,  predict  host,  POStag  host,  adjoin-right-np,  adjoin-left-pp,  adjoin-left- 
pp,  null,  predict  a,  •  •  • ." 
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</s>,TOP 


Figure  1 

A  complete  parse  tree  by  the  structured  language  model. 


A  PLSA  model  (Hofmann  2001)  is  a  generative  probabilistic  model  of  word- 
document  co-occurrences  using  the  bag-of-words  assumption  described  as  follows: 


•  Choose  a  document  d  with  probability  p(d). 

•  SEMANTIZER  selects  a  semantic  class  g  E  Q  with  probability  p(g\d)  where 
Q  denotes  the  set  of  topics. 

•  WORD-PREDICTOR  picks  a  word  vo  6  V  with  probability  p(w\g). 


Because  only  one  pair  of  (d,  w)  is  being  observed,  the  joint  probability  model  is  a  mixture 
of  log-linear  models  with  the  expression  p(d,w)  —  p{d)J2ap{iv\g)p(g\d).  Typically,  the 
number  of  documents  and  the  vocabulary  size  are  much  larger  than  the  size  of  latent 
semantic  class  variables.  Latent  semantic  class  variables  therefore  function  as  bottleneck 
variables  to  constrain  word  occurrences  in  documents. 

When  combining  //-gram,  ///-SLM,  and  PLSA  together  to  build  a  composite 
generative  language  model  under  the  directed  MRF  paradigm  (Wang  et  al.  2005b, 
2006),  the  composite  language  model  is  simply  a  complicated  generative  model  that  has 
four  operators:  WORD-PREDICTOR,  TAGGER,  CONSTRUCTOR,  and  SEMANTIZER. 
The  TAGGER  and  CONSTRUCTOR  in  SLM  and  the  SEMANTIZER  in  PLSA  remain 
unchanged;  the  WORD-PREDICTORs  in  //-gram,  m-SLM,  and  PLSA,  however,  are 
combined  to  form  a  stronger  WORD-PREDICTOR  that  generates  the  next  word,  if/t+i/ 
not  only  depending  on  the  m  most  recently  exposed  headwords  //!,*„  in  the  word-parse 
/e-prefix  but  also  its  n- gram  history  w\_n,  2  and  its  semantic  content  gk+i-  The  parameter 
for  WORD-PREDICTOR  in  the  composite  //-gram / m-SLM /PLSA  language  model 
becomes  p(iv \wzl+1hzl,g)-  The  resulting  composite  language  model  has  an  even  more 
complex  dependency  structure  but  with  more  expressive  power  than  the  original 
SLM.  Figure  2  illustrates  the  structure  of  a  composite  //-gram / ///-SLM /PLSA  language 
model. 

The  composite  //-gram/ ///-SLM /PLSA  language  model  can  be  formulated  as  a 
rather  complex  chain-tree-table  directed  MRF  model  (Wang  et  al.  2006)  with  local 
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Figure  2 

A  composite  n- gram/ m-SLM/PLSA  language  model  where  the  hidden  information  is  the  parse 
tree  T  and  semantic  content  g.  The  n-gram  encodes  local  word  interactions,  the  m-SLM  models 
the  sentence's  syntactic  structure,  and  the  PLSA  captures  the  document's  semantic  content; 
all  interact  together  to  constrain  the  generation  of  natural  language.  The  WORD-PREDICTOR 
generates  the  next  word  wk+1  with  probability  p(wk+1\wkn+2hzltgk+\)  instead  of  p(wk+1\ufy_n+2), 
p(.Wk+ \\hZi),  and  p(wk+1\gk+1),  respectively. 


normalization  constraints  for  the  parameters  of  each  model  component,  WORD- 
PREDICTOR,  TAGGER,  CONSTRUCTOR,  and  SEMANTIZER.  That  is. 


X]  PHW-n+lh-l,g)  =  1 

(2) 

wev 

^p(f  w/zl^.tag)  =  1 

(3) 

teo 

J2p(a\h~l,)  =  i 

(4) 

aeA 

p(g\d )  = 1 

(5) 

gea 

If  we  look  at  the  example  in  Figure  1,  for  the  composite  n-gram/ m-SLM /PLSA 
language  model  there  exists  a  SEMANTIZER's  action  to  choose  a  topic  g  before 
any  WORD-PREDICTOR's  action.  Moreover,  for  m-SLM,  its  WORD-PREDICTOR 
predicts  the  next  word,  such  as  a,  based  on  m  most  recently  exposed  headwords 
"(s)-SB,  show-np,  has-vp,"  but  for  the  composite  model,  the  WORD-PREDICTOR 
predicts  the  next  word  a  based  on  m  most  recently  exposed  headwords  "(s)-SB, 
show-np,  has-vp,"  n-grams  "as  its  host,"  and  a  topic  g.  These  are  the  only  differences 
between  SLM  and  our  proposed  composite  language  model. 

3.  Training  Algorithm 

For  the  composite  //-gram /m-SLM /PLSA  language  model  under  the  directed  MRF 
paradigm,  the  likelihood  of  a  training  corpus  T>,  a  collection  of  documents,  can  be 
written  as 


t{v,P)= n 

dev 


nfe  (upP<w'-T'-GV)') 

.  '  V  &  V  b  ) 


(6) 
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where  ( W1,  T1 ,  GJ\d)  denotes  the  joint  sequence  of  the  Zth  sentence  W1  with  its  parse  struc¬ 
ture  Tl  and  semantic  annotation  string  G1  in  document  d.  This  sequence  is  produced  by 
a  unique  sequence  of  model  actions:  WORD-PREDICTOR,  TAGGER,  CONSTRUCTOR, 
SEMANTIZER  moves;  its  probability  is  obtained  by  chaining  the  probabilities  of  these 
moves 


pP(wi,  t1,  ci\d) = n  ( p(g\d)#{S'w,'G'4)  n 

g£G  \  h_i,-, h-m<EH 


(7) 


_1  ”'l’Z1mg,W,Jl,G,,d) 


P(™\w_l+lh-lg)#{W-n+lWh-mS' 

i  W,™ ,W_n+i€.V 


]Jp{t\wh -l.tag) 
teo 


.tag  ,Wl,Tl,d) 


aeA 


where  #(g,  W1,  G1 ,  d)  is  the  count  of  semantic  content  y  in  semantic  annotation  string  G1  of 
the  Zth  sentence  W1  in  document  d;  #(wZl+1zvhzjng,  W1,  Tl,  G1,  d)  is  the  count  of  72-grams, 
its  m  most  recently  exposed  headwords,  and  semantic  content  g  in  parse  T1  and  semantic 
annotation  string  G 1  of  the  Zth  sentence  W1  in  document  d;  #(to/iZJ1;rtag,  W1 ,  T1 ,  d)  is  the 
count  of  tag  t  predicted  by  word  w  and  the  tags  of  m  most  recently  exposed  headwords 
in  parse  tree  Tl  of  the  Zth  sentence  W1  in  document  d;  and  finally  #(ahz]n,  W1,  T1 ,  d)  is  the 
count  of  constructor  move  a  conditioning  on  m  exposed  headwords  /(!,'„  in  parse  tree  T1 
of  the  Zth  sentence  W1  in  document  d. 

Let 


£{V,p) 


nfn 

dev  \  i 


E  fEpp<w''T',GV)j  j  j 


(8) 


then 


£(V,p)  =  £(V,p)H(p(d)) 

dev 


(9) 


Clearly,  when  maximizing  C(T>,p)  in  Equation  (6),  p(d)  is  an  ancillary  term  that  is 
independent  of  all  other  data-generating  parameters,  it  is  not  critical  to  anything  that 
follows;  moreover,  when  a  language  model  is  used  to  find  the  most  likely  word  se¬ 
quence  in  machine  translation  and  speech  recognition,  this  term  is  useless.  Thus,  similar 
to  an  72-gram  language  model,  we  will  generally  ignore  this  term  and  concentrate  on 
optimizing  Equation  (8)  in  the  subsequent  development. 

The  objective  of  maximum  likelihood  estimation  is  to  maximize  the  likelihood 
C(V,p)  with  respect  to  model  parameters.  For  a  given  sentence,  its  parse  tree  and 
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semantic  content  are  hidden  and  the  number  of  parse  trees  grows  faster  than  expo¬ 
nentially  with  sentence  length;  Wang  et  al.  (2006)  have  derived  a  generalized  inside- 
outside  algorithm  by  applying  the  standard  EM  algorithm  and  considering  the  auxiliary 
function 


q(p',p)  =  E  E  E  E  pp(Tl' G'  i W’ d )  lo§  pv  (' wl> pl' G? \d)  (io) 

dev  l  Gl  T1 

The  complexity  of  this  algorithm  is  sixth  order  (sentence  length),  however;  thus  it  is 
computationally  too  expensive  to  be  practical  for  a  large  corpus  even  with  the  use  of 
pruning  on  charts  (Jelinek  and  Chelba  1999;  Jelinek  2004). 


3.1  N-best  List  Approximate  EM 

Similar  to  SLM  (Chelba  and  Jelinek  1998,  2000;  Chelba  2000),  we  adopt  an  N-best  list 
approximate  EM  re-estimation  with  modular  modifications  to  seamlessly  incorporate 
the  effect  of  n-gram  and  PLSA  components.  Instead  of  maximizing  the  likelihood 
C{V,p),  we  maximize  the  N-best  list  likelihood. 


ma  xC(D,p,T'n) 

T'xt 


n  n 


max 

•// 


E  E  Pp(Wl,Tl,Gl\d) 


dev  \  i  \r'Ne r'N  V  g’  \T'eT'lN,\\T',N\\=N 


(11) 


where  T'1^  is  a  set  of  N  parse  trees  for  sentence  W?  in  document  d,  j  j  •  1 1  denotes  the 
cardinality,  and  T'n  is  a  collection  of  T'lN  for  sentences  over  entire  corpus  D. 

The  N-best  list  approximate  EM  involves  two  steps: 

1 .  N-best  list  search:  For  each  sentence  W  in  document  d,  find  N-best 
parse  trees. 


T^argmaxj^  Pp(Wl,Tl,Gl\d),  \\T,lN\\  -  n} 

t'n  g1  r'er'i, 


and  denote  T\:  as  the  collection  of  N-best  list  parse  trees  for  sentences 
over  entire  corpus  V  under  model  parameter  p. 

1.  EM  update:  Perform  one  iteration  (or  several  iterations)  of  the  EM 
algorithm  to  estimate  model  parameters  that  maximize  N-best  list 
likelihood  of  the  training  corpus  T>, 


£(d,p,Tn) 


n(n 

dev  \  i 


Y,  Pp(Wl,Tl,Gl\d) 

yPeT^eTN 
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That  is, 

(a)  E-step:  Compute  the  auxiliary  function  of  the  N-best  list  likelihood 

=  E  Pp(Tl,Gl\Wl,d)logPp'(Wl,Tl,Gl\d) 

dev  i  Gl  PeT^eTN 

(b)  M-step:  Maximize  Q(p',  p,  77\; )  with  respect  to  p'  to  get  the  new 
update  for  p. 

Iterate  steps  (1)  and  (2)  until  the  convergence  of  the  N-best  list  likelihood. 

We  use  Zangwill's  global  convergence  theorem  (Zangwill  1969)  to  analyze  the 
behavior  of  convergence  of  the  N-best  list  approximate  EM. 

First,  we  define  two  concepts  needed  for  Zangwill's  global  convergence  theorem. 
A  map  M.  is  from  points  of  0  to  subsets  of  0  is  called  a  point-to-set  map  on  0.  It 
is  said  to  be  closed  at  0  if  0,  — 0,0,-  e  0  and  A,-  — >  A,  A,-  e  M(Qj)  implies  A  e  A4(0). 
For  a  point-to-point  map,  continuity  implies  closedness.  Then  the  global  convergence 
theorem  (Zangwill  1969)  states  the  following. 


Theorem 

Let  A4  be  a  point-to-set  map  (an  algorithm)  that,  given  a  point  0(l  6  0,  generates  a 
sequence  {0?fo}  through  the  iteration  0,+1  =  M(6/).  Let  He  0  be  the  set  of  fixed  points 
of  M.  Suppose  (i)  M  is  closed  over  the  complement  of  O;  (ii)  there  is  a  continuous 
function  cf>  on  0  such  that  (a)  if  0  ^  ft,  cf>(A)  >  cf)(0)  for  all  A  (E  A 1(0),  and  (b)  if  0  €  12, 
4>(A)  >  cj?(0)  for  all  A  e  M(9). 

Then  all  the  limit  points  of  {0,}  are  in  f 2  and  cjj ( 0 ,)  converges  monotonically  to  cj) (0 ) 
for  some  0  £  fl. 


Proof 

This  theorem  has  been  used  by  Wu  (1983)  to  prove  the  convergence  of  a  standard  EM 
algorithm  (Dempster,  Laird,  and  Rubin  1977).  We  now  use  this  theorem  to  show  that 
the  N-best  list  approximate  EM  algorithm  globally  converges  to  the  stationary  points 
of  the  N-best  list  likelihood.  We  encounter  one  difficulty  at  this  point,  however,  due  to 
the  maximization  operator  in  Equation  (11);  after  each  iteration  the  N-best  list  may  have 
been  changed,  therefore  the  set  of  data  presented  for  the  estimation  of  model  parameters 
may  be  different  from  the  previous  one.  Nevertheless,  we  prove  the  convergence  of  the 
N-best  list  approximate  EM  algorithm  by  checking  whether  it  satisfies  two  conditions 
in  Zangwill's  global  convergence  theorem.  Because  the  composite  model  is  essentially 
a  mixture  model  of  a  curved  exponential  family  through  a  complex  hierarchy,  there 
is  a  closed  form  solution  for  the  Q(p' ,p,Tn)  function  irrespective  of  the  N-best  list 
parse  trees,  so  the  N-best  list  approximate  EM  algorithm  is  a  one-to-one  map.  Because 
Q(p’,P,Tm)  is  continuous  in  both  p'  and  p,  the  map  is  closed,  thus  condition  (i)  is 
satisfied. 

To  check  condition  (ii),  we  need  to  verify  that  the  N-best  list  likelihood  as  a  function 
of  p  satisfies  the  properties  of  4>(0)  in  condition  (ii).  Let  7\;  and  7T\;  he  the  two  collections 
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of  N- best  list  parse  trees  for  sentences  over  entire  corpus  V  under  two  model  parameters 
p  and  p,  respectively: 

%  -  arg  max  C(D,p,T'N)  (12) 

T'n 

Tn  =  arg  max  £(D,  p,  T'n)  (13) 

T'n 

and  let  p  be  the  closed  form  solution  of  maximizing  Q(p',  p,  Tn)  with  respect  to  p',  that  is, 

P  =  arg  max  Q(p\  p,  fN)  (14) 

V 

Then 


max  C(V,  p,  T'n)  >  £(£>,P,7n) 

(15) 

T'n 

>  c.{'d,p,Tn) 

(16) 

>  max  C(T>,  p,  T'n) 

T'n 

(17) 

The  inequality  in  Equation  (15)  is  strict  unless  Tn  —  Tn,  which  results  in  p  E  M{p). 
Using  results  proven  by  Wu  (1983),  we  know  that  when  p  is  not  a  stationary  point  of  the 

N-best  list  likelihood  or  p  £  M(p),  aL(P-NN)  =  3QT  ,U7k)  ^  Q(p,p,7^)  >  Q(p,p,%), 

thus  the  inequality  in  Equation  (16)  is  strict.  Finally,  the  inequality  in  Equation  (17)  is 
strict  unless  p  €  A l(p).  Thus  condition  (ii)  is  satisfied. 

This  completes  the  proof  that  the  N-best  list  approximate  EM  algorithm  mono- 
tonically  increases  the  N-best  list  likelihood  and  converges  in  the  sense  of  Zangwill's 
global  convergence. 

In  the  following,  we  formally  derive  the  N-best  list  approximate  EM  algorithm  with 
linear  sentence  length  time  complexity.  ■ 

3.1.1  N -best  List  Search  Strategy.  For  each  sentence  W  in  document  d,  instead  of  scanning 
all  the  hidden  events  (both  allowed  parse  trees  and  semantic  annotation  strings)  we 
restrict  the  algorithm  to  operate  with  N-best  hidden  events.  We  find  that,  for  each 
document,  a  large  number  of  topics  should  be  pruned  and  only  a  small  set  of  allowed 
topics  should  be  kept  due  to  the  considerations  of  both  computational  time  and  resource 
demand,  otherwise  we  have  to  use  many  more  machines  to  store  WORD-PREDICTOR's 
parameters. 

We  can  either  find  both  the  N-best  parses  for  each  sentence  and  N-best  topics  for 
each  document  simultaneously  or  separately.  The  latter  is  much  preferred,  because  the 
first  case  is  much  more  computationally  expensive. 

To  extract  the  N-best  topics,  we  run  an  EM  algorithm  for  a  PLSA  model  on  training 
corpus  V,  then  keep  the  N  most  likely  topics  (denoted  as  Qj)  according  to  the  values  of 
p(g\d);  the  rest  of  the  topics  are  purged. 

To  extract  the  N-best  parse  trees,  we  adopt  a  synchronous,  multi-stack  search 
strategy  that  is  similar  to  the  one  in  Chelba  and  Jelinek  (1998,  2000)  and  Chelba 
(2000),  which  involves  a  set  of  stacks  storing  partial  parses  of  the  most  likely  ones 
for  a  given  prefix  W/c  and  the  less  probable  parses  are  purged.  Each  stack  contains 
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hypotheses  (partial  parses)  that  have  been  constructed  by  the  same  number  of  WORD- 
PREDICTOR  and  the  same  number  of  CONSTRUCTOR  operations.  The  hypotheses  in 
each  stack  are  ranked  according  to  the  log(Pp(Wj-,  Tk\d))  score  with  the  highest  on  top, 
where  Pp(Wk,  Tk\d)  =  Y^Gk  Ppi^k,  Tk,  Gk\d)  and  the  Wk,  Tp,  Gk  denote  the  joint  sequence 
of  prefix  WP  —  zvq,  zvj  ■  ■  ■  ,  zu/:  with  its  parse  structure  T \  and  semantic  annotation  string 
G/c  =  gir  ■  ■  ,gio  gi  C  Qd, i  —  '],■■■  ,k  in  document  d.  This  sequence  is  produced  by  a 
unique  sequence  of  model  actions:  WORD-PREDICTOR,  TAGGER,  CONSTRUCTOR, 
and  SEMANTIZER  moves.  Its  probability  is  obtained  by  chaining  the  probabilities  of 
these  moves.  The  value  of  Pp(Wk,Tk\d)  is  computed  recursively  from  Pp(Wk-i,  Tk-i\d) 
by  the  following  formula: 


pr,(m,Tk\d)  =  pv(wk_1,Tk_l\d)  J2  | 

\gkegd  l~’giegiP'S<\ 

p(h\wk,hZ1m.tag)p(Tk_1/k\Wk_1Tk_l,wkltk) 


(18) 


where  W,t_iTj._i  is  the  word-parse  (k  —  l)-prefix;  zvk  is  the  kth  word  predicted  by 
WORD-PREDICTOR;  4  is  the  tag  assigned  to  zvk  by  the  TAGGER;  Tk-\rk  is  the  incre¬ 
mental  parse  structure  that  generates  Tk  —  Tk_\  |  T/(_-jy  when  attached  to  Tk-\,  (this 
is  the  parse  structure  built  on  top  of  Tk-\  and  the  newly  predicted  word  zi\);  the  || 
notation  stands  for  concatenation.  Finally,  p(Tk-i,k\Wk-iTk-i>wk/tk)  is  the  product  of 
the  probabilities  of  a  series  of  CONSTRUCTOR  moves  in  Tk- \:k  to  form  T /. .  Because  the 
topics  are  pruned  to  Gj,  the  probability  of  the  SEMANTIZER  is  normalized  to  ensure  a 
proper  probability  distribution.  A  stack  vector  consists  of  the  ordered  set  of  stacks  con¬ 
taining  partial  parses  with  the  same  number  of  WORD-PREDICTOR  operations  but  a 
different  number  of  CONSTRUCTOR  operations.  In  WORD-PREDICTOR  and  TAGGER 
operations,  some  hypotheses  are  discarded  due  to  the  maximum  number  of  hypotheses 
that  the  stack  can  contain  at  any  given  time.  In  the  CONSTRUCTOR  operation,  the 
resulting  hypotheses  are  discarded  due  to  either  finite  stack  size  or  the  log-probability 
threshold  (the  maximum  tolerable  difference  between  the  log-probability  score  of  the 
top-most  hypothesis  and  the  bottom-most  hypothesis  at  any  given  state  of  the  stack). 
The  synchronous,  multi-stack  search  strategy  is  a  greedy  best-first  search  algorithm, 
one  of  the  local  heuristic  search  procedures  that  does  not  use  future  cost  estimates 
to  guide  the  search  and  thus  does  not  guarantee  that  the  N-best  list  parse  trees  are  a 
global  optimal  solution  (Russell  and  Norvig  2010).  In  practice,  however,  we  find  that 
the  N-best  list  approximate  EM  algorithm  does  converge  within  several  iterations. 

3.1.2  EM  Update.  Once  we  have  both  the  N-best  parse  trees  for  each  sentence  in  docu¬ 
ment  d  and  the  N-best  topics  for  document  d,  we  derive  the  EM  algorithm  to  estimate 
model  parameters. 

Maximizing  Q(p',p,  Tn)  with  respect  to  p'  leads  to  re-estimated  parameters  of  the 
composite  model,  which  are  nothing  but  the  following  normalized  conditional  expected 
counts: 


p'(zv\zv 


-1 

—  n+V 


Pp(Tl,  Gl\W,d)#(w. 


-i+i 


whZlmg,W,T\Gl,d)  (19) 


dev  i  G'  her^eTw 
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p'(f|a*_Jrtag)  oc  EE  E  Pp(Tl\Wl,d)#(twh_ll.tagrWlrTlrd)  (20) 

dev  i  T'6T^6Tn 

p'H/C^oc^^T  ^  Pp(Tl\Wl,d)#(cihZlWl,Tl,d)  (21) 

dev  i  T'eT{,eTN 

EE  E  Pp(Tl,Gl\Wl,d)#(g,Wl,Glrd)  (22) 

dev  i  g1  t 'er^erN 

In  the  E-step,  we  use  Equations  (19)-(22)  to  compute  the  expected  count  of  each 
model  parameter  over  sentence  W1  in  document  d  in  the  training  corpus  V.  In  the  full 
case  where  the  number  of  parse  trees  grows  faster  than  exponentially  with  sentence 
length,  we  use  Jelinek-style  recursive  formulas  in  the  generalized  inside-outside  algo¬ 
rithm  (Jelinek  2004)  to  handle  the  tree  structure  and  describe  the  weighted  forest  of 
possible  derivations  (Wang  et  al.  2006).  In  the  N-best  list  case  considered  in  this  paper, 
however,  we  just  enumerate  each  parse  tree  in  the  N-best  list  and  compute  the  expected 
posterior  count  for  each  parse  tree.  For  the  WORD-PREDICTOR  and  the  SEMANTIZER, 
we  use  Equations  (19)  and  (22)  and  note  that  there  is  a  sum  over  semantic  annotation  se¬ 
quence  G;  where  the  number  of  possible  semantic  annotation  sequences  is  exponential. 
We  use  forward-backward  recursive  formulas  reminiscent  of  those  in  hidden  Markov 
models  to  compute  the  expected  counts.  To  be  more  specific,  for  each  parse  T 1  6  7^,  we 
define  the  forward  vector  oc;(y|d)  to  be 

4+ Md)  =  ^PpiWlTk’Wk-n+iWk+ihZlg^lld)  (23) 

=  Pp(Wlj{,wkk_n+2wk+1hZ1mg\d) 

=  Pp(  W[,  T'k\d)p(wk+1  \zvkk_  n+2hZlg,  d)  EPg^+^]d) 

where  W[  is  the  word  k- prefix  for  sentence  W1,  and  T[  is  the  parse  for  k- prefix.  It  is  easy 
to  see  that  the  forward  vector  a;(y|d)  can  be  recursively  computed  in  a  forward  manner 
using  Equation  (18)  as 


4+ifeld)  =  J2  ak(gk\d)  p(tk\wk,hJmdag)p(T[_lfk\W[_j[_lrwk,tk)  (24) 

\gkklQd  ) 

We  define  the  backward  vector  (3?(y|d)  to  be 

ti+i(g\d)  =  E  PP(K+i,^[+L-'4+i,l4-n+2^+i^^d)  (25) 


where  W[+1 .  =  zvlk+ 2,  •  •  •  ,  (/s)  is  the  subsequence  after  word  zv[+1  in  sentence  W ,  T[+1 . 
is  the  incremental  parse  structure  after  the  parse  structure  Tk+1  of  word  (k  +  l)-prefix 
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W[+ 1  that  generates  parse  tree  Tl,  Tl  —  T[+1 1 |T[+1  ,  and  G[+1 .  =  gk+2>  •  ■  ■ ,  is  the  seman¬ 
tic  subsequence  in  G  relevant  to  W[+1  .  Again  it  is  easy  to  see  that  the  backward  vector 
|3;(y|d)  can  be  recursively  computed  in  a  backward  manner  as 

P*+i(ffld)  =  P(h+i\wk+i,hzi.tag)p(Tlk+1\W[Tlwk+1Jk+^  (26) 

Pp(Wi+2,-'  Tk+2,-'Gk+2,-  Htl+zWwhZlngk+l,  d) 

Gk+ 2,. 

=  p(4+i  K+n  *C« -tag)p(T[/t+i  I  W'T[,  ipfc+i,  4+1 ) 

5Z  p(wk+2\wl_n+3hzlgk+2,d)^k+^\d)  $lk+2(gk+2\d) 

gk+l&Gi  gi^Qi  1 

Then,  the  expected  count  of  zvZjl+1zvhzl,g  for  the  WORD-PREDICTOR  on  sentence  W1 
in  document  d  is 

5]  5]  Pp(Tl,Gl\Wl,d)#(wZ1n+1whZ1mg,WI,TI,Gl,d)  (27) 

G'  TleT^eTN 

=  ]T  E  PP(T\G\wl\d)#(zvzl+1zvhzl,g,wl,T,rGlrd)/pp(wI\d) 

G'  PeT^eTN 

=  ££  afc+l(#l^)Pp+l(£l^)Mwfc— n+2wfc+l^-mSk+l  —  w -n+lW^ -mS)  /  ^ p(Y^  \d) 
l  k 

where  Pp(WV)  =  Eg'  EheT^T^pC^G''  =  Eber'er*  pp(t'>  Pp(T',  W'|d) 

is  recursively  computed  by  Equation  (18)  through  traversing  the  Zth  parse  tree  T‘  <E  7~y 
of  sentence  W/  from  left  to  right,  and  6(  )  is  an  indicator  function.  The  expected  count 
of  g  for  the  SEMANTIZER  on  sentence  W1  in  document  d  is 

Pp(Tl,  Gl  |  Wl,  d)#(g,  Wl,  Gl,  d)  (28) 

Gl  PeT^eTN 

=  ^^oi+i(s\d)^k+i(g\d)p(.zvk+1\wkk_n+2hzlng)/Pp(Wl\d) 

l  k 

For  the  TAGGER  and  the  CONSTRUCTOR,  we  use  Equations  (20)  and  (21),  and  the 
expected  count  of  each  event  of  tofil^.tag  and  nhz],,  over  parse  Tl  of  sentence  W1  in 
document  d  is  the  real  count  appearing  in  parse  tree  T1  of  sentence  W;  in  document  d 
times  the  conditional  distribution  Pp{T‘\  W?,d)  =  PpiT1,  W?|d)/  Ebgr'  Pp(E  W;  |d) — that 
is,  Pp(T' \  W' , d)#(tzvhzlrtag,  W,  Tl,d )  and  Pp{Tl\Wl ,d)#(ahZ^n,  W?,  Tl,d),  respectively. 

When  only  SLM  is  considered,  the  expected  count  for  each  model  component, 
WORD-PREDICTOR,  TAGGER,  and  CONSTRUCTOR,  over  parse  T 1  of  sentence  Wl  in 
document  d  is  the  real  count  that  appeared  in  parse  T1  of  sentence  W1  in  document  d 
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times  the  posterior  probability  Pp(Tl  \  W1,  d),  as  is  done  in  Chelba  and  Jelinek  (1998,  2000) 
and  Chelba  (2000). 

In  the  M-step,  the  recursive  linear  interpolation  scheme  (Jelinek  and  Mercer  1980) 
is  used  to  obtain  a  smooth  probability  estimate  for  each  model  component  (WORD- 
PREDICTOR,  TAGGER,  and  CONSTRUCTOR).  The  TAGGER  and  CONSTRUCTOR  are 
conditional  probabilistic  models  of  the  type  p{n\zlr  •  •  •  ,  z„ )  where  u,zlr  •  •  •  ,z„  belong  to 
a  mixed  set  of  words,  POS  tags,  NTtags,  and  CONSTRUCTOR  actions  (u  only);  and 
Z\,  •  •  •  ,z„  form  a  linear  Markov  chain.  The  recursive  mixing  scheme  is  the  standard  one 
among  relative  frequency  estimates  of  different  orders  k  —  0,  •  •  •  ,  n  and  has  been  ex¬ 
plained  in  Chelba  and  Jelinek  (1998,  2000)  and  Chelba  (2000).  The  WORD-PREDICTOR 
is,  however,  a  conditional  probabilistic  model  p(zv\wZlJ+1hzltg)  where  there  are  three 
kinds  of  context,  u>zjI+1,  hz]n,  and  g — each  forms  a  linear  Markov  chain.  The  model 
has  a  combinatorial  number  of  relative  frequency  estimates  of  different  orders  among 
three  linear  Markov  chains.  We  generalize  Jelinek  and  Mercer's  (1980)  original  recur¬ 
sive  mixing  scheme  to  handle  the  situation  where  the  context  is  a  mixture  of  Markov 
chains.  The  factored  language  (FL)  model  (Bilmes  and  Kirchhoff  2003)  is  close  to  the 
smoothing  technique  we  propose  here,  the  major  difference  is  that  FL  considers  all 
possible  combination  of  the  context  of  conditional  probability  that  can  be  concisely 
represented  by  a  factor  graph,  whereas  our  approach  strictly  respects  the  order  of 
Markov  chains  for  word  sequence  and  headword  sequence  because  we  believe  natural 
language  tightly  follows  these  orders;  moreover,  where  FL  uses  a  backoff  technique, 
we  use  linear  interpolation. 

Consider  a  composite  trigram/2-SLM/PLSA  language  model.  Figure  3  illustrates 
a  lattice  formed  of  all  possible  conditional  probabilistic  models  and  relative  frequency 


ptwIw^Wjhjhj)  p(wlw_,Wj  h_j)  p(wlw_2w, ) 


Figure  3 

Recursive  linear  interpolation  lattice  to  estimate  WORD-PREDICTOR  p(w\w_2w_ih_2h~ig)  of 
the  composite  trigram /2-SLM/PLS A  language  model,  where  U  is  the  vocabulary  in  which  the 
predicted  random  variable  w  takes  values  and  p{U)  denotes  uniform  distribution  of  U.  The 
lattice  is  formed  by  three  linear  Markov  chains,  W-2w_ i,  h_2h-i,  and  g.  Starting  from  p{U), 
each  vertex  is  visited  in  a  bottom-up,  back  to  front,  and  right  to  left  order. 
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estimates  of  different  orders  along  each  of  the  three  linear  Markov  chains.  Each  vertex 
in  the  lattice  represents  a  conditional  probabilistic  model  that  is  a  linear  interpolation  of 
vertices  having  directed  arcs  pointing  to  this  vertex  and  its  relative  frequency  estimate; 
the  linear  interpolation  coefficients  are  the  weights  of  directed  arcs.  For  example,  the 
WORD-PREDICTOR  p(zv\zV-2ZV-ih-2^-ig)  is  a  linear  interpolation  of  three  conditional 
probabilistic  models,  p(zv\zv-ih_2h_1g),  p(zv\zv_2zv_1h_1g),  p(zv\zv_2zv-ih_2h_1),  and 
their  relative  frequency  estimate/(if  \zv_2zv_1h_2h_1g), 

p(w\w_2w-ih-2h-ig)  =  \v(zv_2w-ih-2h-ig)  •  p(w|w_i/i_2/z_ig)  (29) 

+\(zv_2zv_1h_2h_1g)  ■  p(zv\zv_2zv_1h_1g) 
+'\g(zv_2zv_ih_2h_1g)  ■  p(zv\zv_2zv_1h_2h_1) 

+(1  -  \w(zv_2w-ih-2h-ig)  -  A;i(a;_2Zc_i/;_2/7-iy) 

-\(zv_2zv_1h_2h_1g))  ■f(zv\zv_2zv_1h_2h_1g) 


where  \w(zv_2W-ih-2h-ig),\(zv-2W-ih-2h-ig),  and  A<,(zv_2zv-ih_2h_1g)  are  non¬ 
negative  context-dependent  interpolation  coefficients  with  a  sum  of  less  than  1; 

f(zv\zv_2zv-1h_2h_1g)  =  Cc^w^hSSz ig)'  and  is  the  expected 

count  of  the  event  zv-2zv-izvh~2^-i g  that  is  extracted  from  the  training  cor¬ 
pus  by  the  E-step  of  the  N-best  approximate  EM  algorithm,  C(zv _2w~i  ^-2^-ig)  = 
Yjweu  C(zv_2zv_2wh-2h-ig)-  The  linear  interpolation  coefficients  are  grouped  into 
equivalence  classes  (tied)  based  on  the  range  into  which  the  count  falls;  the  count  ranges 
for  each  equivalence  class,  "buckets,"  are  set  such  that  a  statistically  sufficient  number 
of  events  fall  within  that  range.  In  our  experiments,  we  set  the  count  ranges  to  be  the 
intervals  of  2!,  i  —  0, 1,  •  •  •  ,10  (i.e.,  0,  1,  2,  4,  8,  16,  32,  64,  128,  256,  512,  1024,  and  00). 
These  "tied"  interpolation  weights  are  determined  by  the  maximum  likelihood  estimate 
from  cross-validation  data  through  the  EM  algorithm  (Dempster,  Laird,  and  Rubin  1977) 
where  we  use  a  public  available  parser  in  the  openNLP  software1  to  parse  sentences  in 
cross-validation  data,  and  we  run  LSA  to  extract  N  most  likely  topics  for  each  document 
in  cross-validation  data,  then  we  gather  joint  counts  for  each  model  component,  WORD- 
PREDICTOR,  TAGGER,  CONSTRUCTOR  used  to  determine  interpolation  weights. 

In  the  M-step,  assuming  that  the  count  ranges  and  the  corresponding  interpolation 
values  for  each  order  are  kept  fixed  to  their  initial  values,  the  only  parameters  to  be 
re-estimated  using  the  EM  algorithm  are  the  maximal  order  counts  for  each  model 
component.  The  interpolation  scheme  outlined  here  is  then  used  to  obtain  a  smooth 
probability  estimate  for  each  model  component. 

3.2  Follow-up  EM 

As  explained  in  Chelba  and  Jelinek  (2000)  and  Chelba  (2000),  for  the  SLM  component 
a  large  fraction  of  the  partial  parse  trees  that  can  be  used  for  assigning  probability  to 
the  next  word  do  not  survive  in  the  synchronous,  multi-stack  search  strategy,  thus  they 
are  not  used  in  the  N-best  approximate  EM  algorithm  for  the  estimation  of  WORD- 
PREDICTOR  to  improve  its  predictive  power.  To  remedy  this  weakness,  we  estimate  a 


1  http:  // www. codeproject . com/KB/recipes/englishparsing. aspx. 
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separate  WORD-PREDICTOR  (and  SEMANTIZER)  model  using  the  partial  parse  trees 
exploited  by  the  synchronous,  multi-stack  search  strategy. 

First,  we  look  at  how  to  compute  the  language  model  probability  assignment  for  the 
word  at  position  k+1  in  the  input  sentence  of  document  d  when  the  word-parse  k- prefix 
WkTk  is  available.  From  the  causal  relationship  among  the  parameters  of  the  composite 
n-gram/m-SLM /PLSA,  we  have 


Pp(zvk+1\Wkrd)  =  Y  Pp(wk+i,Tk,gk+i\Wk,d)  (30) 

Tk&Zk,gk+l&Gd 

=  Y  Pp^+ 1 1 Wk'  T^+v d)Pp(Tk\Wk, d) 

Tkezk,gl+1egd  2^giegdPl8i\ 

Y  P(™k+i  I  dk-n+2h-hgk+i  )pP(Tk  \wk,d) 

hJ'm&’Pk'd'k^zk,gk+1Egd 


Eg  Pp(Wk’Tk,GM 

where  Pp(T/c|  Wj.,d)  =  ^ ^  p  ^  T ^  to  ensure  a  proper  probability  normaliza¬ 

tion  over  word  strings  Wk;  Zk  is  the  set  of  all  parses  present  in  the  stacks  at  the  current 
stage  k  during  the  synchronous  multi-stack  pruning  strategy  and  it  is  a  function  of 
the  word  fc-prefix  Wk  =  w 0,  •  •  •  ,wk,  Gk  —  gi,  ■  ■  ■  ,gk>gi  G  Qi,i  =  b  ■  ■  ■  ,k  is  the  semantic 
string  up  to  k;  and  Pp(Wk,  Tk,  Gk\d)  is  the  joint  probability  of  word-parse  k- prefix  WkTk 
and  its  semantic  string  Gk  in  a  document  d. 

The  likelihood  of  a  training  corpus  V  under  this  language  model  probability  as¬ 
signment  that  uses  partial  parse  trees  generated  during  the  process  of  the  synchronous, 
multi-stack  search  strategy  can  be  written  as 

C{V,p)  =  MI  (PP(WV))  (31) 

dev  i 


where  Pp(Wl\d)  =  Y\kPp  (zvk+i  I  d)  and  W?  is  the  Zth  sentence  in  document  d.  Again, 

similar  to  Equation  (8),  we  ignore  the  ancillary  term  p(d)  in  Equation  (31). 

We  use  a  second  stage  of  parameter  re-estimation  for  p(Wk-\-i\wk_n+2hzl„gk+i) 
and  p(gk+i\d)  by  maximizing  Equation  (31)  to  improve  WORD-PREDICTOR's 
predictive  power.  In  this  case,  the  estimation  of  the  WORD-PREDICTOR  is  for 
the  emission  probability  of  a  hidden  Markov  model  with  fixed  transition  probabil¬ 
ities  (although  dependent  on  the  position  k  in  the  input  sentence)  specified  by  the 

Pp(Tk | Wk,  d)  Y,P(Skg+p(g  \d)  values-  We  use  EM  again.  The  E-step  is  to  gather  expected 

joint  counts  C(wl_n+2zvk+1hZ1mgk+i,d)  and  C(gk+i,d)  of  the  WORD-PREDICTOR 
model  by  accumulating  each  count  at  position  k  weighted  by  a  posterior  probability 
Pp(Tkrgk+1  \wk+1,  Wjfc,  d ),  namely. 


Pp(Tk,gk+i\^k+1,Wk,d) 


p{wk+i\wkk_n+2h_]ngk+1  )p(gk+i\d)Pp(Tk\Wk,  d) 

EkzieTkeZuegdP^+M-n+ih-^Pigl^PpiTklWkJ) 


The  M-step  uses  the  same  count  smoothing  technique  as  that  described  in  the  N-best 
list  approximate  EM. 
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3.3  Distributed  Architecture 

When  using  very  large  corpora  to  train  our  composite  language  model,  the  data  and  the 
parameters  cannot  be  stored  together  on  a  single  machine,  so  we  have  to  resort  to  dis¬ 
tributed  computing.  The  topic  of  large-scale  distributed  language  models  is  relatively 
new,  and  existing  work  is  restricted  to  h -grams  only  (Zhang,  Hildebrand,  and  Vogel 
2006;  Brants  et  al.  2007;  Emami,  Papineni,  and  Sorensen  2007).  Although  all  existing 
research  use  distributed  architectures  that  follow  the  client-server  paradigm,  the  real 
implementations  are  in  fact  different.  Zhang  et  al.  (2006)  and  Emami  et  al.  (2007)  store 
training  corpora  in  suffix  arrays  such  that  one  sub-corpus  per  server  serves  raw  counts, 
and  test  sentences  are  loaded  in  a  client.  This  implies  that  when  computing  the  language 
model  probability  of  a  sentence  in  a  client,  all  servers  need  to  be  contacted  for  each 
n-gram  request.  The  approach  by  Brants  et  al.  (2007)  follows  a  standard  MapReduce 
paradigm  (Dean  and  Ghemawat  2004):  The  corpus  is  first  divided  and  loaded  into  a 
number  of  clients,  and  n-gram  counts  are  collected  at  each  client,  then  the  n-gram  counts 
are  mapped  via  hashing  and  are  stored  in  a  number  of  servers,  resulting  in  exactly  one 
server  being  contacted  per  H-gram  when  computing  the  language  model  probability 
of  a  sentence.  We  adopt  a  similar  approach  to  Brants  et  al.  (2007)  and  make  it  suitable 
to  perform  iterations  of  the  N-best  list  approximate  EM  algorithm  (see  Figure  4).  The 
corpus  is  divided  and  loaded  into  a  number  of  clients.  We  use  a  publicly  available 
parser  to  parse  the  sentences  in  each  client  to  get  the  initial  counts  for  zvZjl+1zvhzjng 
(WORD-PREDICTOR),  twhz]n. tag  (TAGGER),  and  ahz]n  (CONSTRUCTOR),  we  finish 
the  Map  part,  and  then  the  counts  for  a  particular  zvZ}n+{whZlmg  at  different  clients  are 
summed  up  and  stored  in  one  of  the  servers  by  hashing  through  word  zjo_\,  headword 
h_i,  and  its  topic  g.  The  counts  for  all  to/C,1,, -tag  and  ahz]n  at  different  clients  are 
summed  up  and  stored  in  one  of  the  servers,  then  we  complete  the  Reduce  part.  This 
is  the  initialization  of  the  N-best  list  approximate  EM  step.  Each  client  then  calls  the 
servers  for  parameters  to  perform  a  synchronous  multi-stack  search  for  each  sentence 
to  get  the  N-best  list  parse  trees.  Again,  the  expected  count  for  a  particular  parameter  of 
wZ]t+i whz]ng,  twhz]n- tag,  and  nhz\t  at  the  clients  are  computed,  thus  we  finish  the  Map 
part.  The  expected  count  of  wZ\+-\WhZ]ng  are  then  summed  up  and  stored  in  one  of  the 
servers  by  hashing  through  word  W-\,  headword  //_ and  its  topic  g,  and  the  counts 
for  all  twhz]n- tag  and  ahz]n  at  different  clients  are  summed  up  and  stored  in  one  of 
the  servers;  thus  we  finish  the  Reduce  part.  The  SEMANTIZER  has  document-specific 


Figure  4 

Distributed  architecture  is  essentially  a  MapReduce  paradigm:  Clients  store  partitioned  data  and 
perform  the  E-step:  compute  expected  counts;  this  is  "Map."  Servers  store  parameters  (counts) 
for  the  M-step  where  counts  of  zvZl+1zvhzhg  are  hashed  by  word  ro_i,  headword  h-\,  and  its 
topic  g  to  evenly  distribute  these  model  parameters  into  servers  as  much  as  possible  and  counts 
of  fzr/C^.tag  and  cihz}n  are  stored  into  one  server;  this  is  "Reduce." 
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parameters,  thus  the  EM  iterative  updates  are  performed  at  each  of  local  clients.  We 
repeat  this  procedure  until  convergence. 

Similarly,  we  use  a  distributed  architecture  as  in  Figure  4  to  perform  the  follow-up 
EM  algorithm  to  re-estimate  WORD-PREDICTOR. 

4.  Using  the  Model  for  Testing 

When  a  language  model  is  used  in  one-pass  decoders  of  speech  recognition  and 
phrased-based  MT  systems  to  guide  the  search,  the  search  space  is  organized  as  a  prefix 
tree  and  operates  left  to  right,  thus  we  need  to  know  the  language  model  probability  at 
the  word  level  given  by  Equation  (30)  one  word  at  a  time.  Because  a  document  of  the 
test  data  is  not  contained  in  the  original  training  corpus,  to  compute  the  language  model 
probability  assignment  for  word  itffc+i  we  use  a  "fold-in"  heuristic  approach  similar 
to  the  one  used  in  Hofmann  (2001):  The  parameters  corresponding  to  SEMANTIZER, 
p(g\d),  are  re-estimated  by  maximizing  the  probability  of  word  subsequence  seen  so 
far — that  is,  a  pseudo-document  d k  =  (W/;,  S),  where  S  is  the  set  of  previous  sentences  of 
a  document  in  test  data — while  holding  the  other  parameters  fixed.  Wang  et  al.  (2005b) 
use  on-line  gradient  ascent  to  re-estimate  these  parameters.  We  use  three  methods,  one- 
step  on-line  EM,  on-line  EM  with  fixed  learning  rate,  and  batch  EM,  to  re-estimate  these 
parameters.  Both  one-step  on-line  EM  and  on-line  EM  with  fixed  learning  rate  use 
Equation  (32)  with  y  set  to  -  j  ^  and  a  constant  0.2,  respectively. 

EhzieT^T^ez^P^H-l+ib-^Pig^k-^PpiTk-ilWk-iJk-i) 

p(g\dk)  P(^l^li+itmg)pfel4-i)^P(^-il ^Nk-xA-x) 

+(1  ~V)p(g\dk-i)  (32) 


The  batch  EM  is  the  standard  EM  algorithm  where  we  repeat  the  iterative  procedure 


until  convergence.  The  initial  values  are  set  to 


Pi 


■sis  ed 


pfe;b) 


,  where  for  the  topics 


that  are  purged  we  just  plug  in  0  for  p(g\d).  #(d)  is  the  number  of  words  in  document 
d,  d  E  T>,  and  |2?|  =  #(d)  denotes  the  size  of  training  corpus  (which  is  the  total 

number  of  words  in  the  entire  training  corpus). 

When  we  use  Equation  (30)  to  compute  perplexity,  the  system  only  uses  information 
coming  from  previous  words  to  generate  a  topic  distribution,  which  then  is  used  to 
predict  the  next  word,  so  the  sum  over  all  next  words  is  1. 

We  find  that  the  perplexity  results  are  sensitive  to  these  three  methods  and  the  initial 
values.  For  example,  for  batch  EM,  if  we  set  initial  values  to  be  those  obtained  by  using 
the  pseudo-document  up  to  the  previous  word  dk_ i  =  (lAh_  |,S)  and  trained  by  batch 
EM,  we  obtain  worse  perplexity  results.  Table  8  in  Section  6.2  gives  perplexity  results 
that  use  these  three  methods  to  re-estimate  the  parameters  of  the  SEMANTIZER,  where 
the  on-line  EM  with  fixed  learning  rate  not  only  has  the  cheapest  computational  cost 
but  also  leads  to  the  highest  perplexity  reductions. 


5.  Related  Work 

Besides  the  work  by  Wang  et  al.  (2005b,  2006)  that  was  discussed  in  the  Introduction, 
the  closest  work  to  ours  is  that  by  Khudanpur  and  Wu  (2000)  where  the  authors  used 
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SLM  and  a  word  clustering  model  to  extract  relevant  grammatical  and  semantic  fea¬ 
tures,  then  integrated  these  features  with  M-grams  by  a  maximum  conditional  entropy 
approach.  Our  composite  language  model  is  a  generative  model,  all  features  play  impor¬ 
tant  roles  in  the  EM  iterations  to  allow  maximal  order  events  for  WORD-PREDICTOR  to 
appear;  in  Khudanpur  and  Wu  (2000),  however,  the  counts  for  all  events  are  fixed  after 
feature  extraction  from  SLM  and  word  clustering  and  no  new  maximal  order  events 
for  WORD-PREDICTOR  are  possibly  extracted,  this  potentially  hinders  the  predictive 
power  of  WORD-PREDICTOR.  Moreover,  the  training  algorithm  in  Khudanpur  and  Wu 
is  computationally  expensive.  Both  methods  use  the  first-stage  N-best  list  approximate 
EM  to  extract  headwords,  thus  the  complexity  is  at  the  same  order  at  this  stage;  at 
second  stage,  however,  where  we  use  the  follow-up  EM,  they  use  the  maximum  en¬ 
tropy  approach.  The  maximum  entropy  approach  is  more  expensive,  mainly  in  feature 
expectation  and  normalization  as  well  as  optimization  (such  as  iterative  scaling  or  the 
quasi  Newton  method);  ours  is  quite  simple,  which  is  expected  relative  to  frequency 
estimates  with  proper  smoothing. 

The  highest  reported  perplexity  reductions  are  those  by  Goodman  (2001),  where  the 
author  examines  the  techniques  of  caching,  clustering,  higher-order  n-grams,  skipping 
models,  and  sentence-mixture  models  in  various  combinations  (mainly  linear  interpola¬ 
tion).  The  author  compares  to  the  baseline  of  a  Katz  smoothed  trigram  with  no  count  cutoffs. 
On  a  small  training  corpus  with  100k  tokens,  a  50%  perplexity  reduction  (1  bit  improve¬ 
ment)  is  obtained.  On  a  larger  corpus  with  284  million  tokens  without  punctuation, 
the  improvement  declines  to  38%;  we  assume  that  this  improvement  shrinks  to  30% 
when  compared  with  4-gram  as  the  baseline. 

6.  Experimental  Results 

In  this  section,  we  first  explain  the  experimental  set-up  for  our  experiments,  we  then 
show  comprehensive  perplexity  results  in  various  situations,  and  we  end  by  reporting 
the  results  when  we  apply  the  composite  language  model  to  the  task  of  re-ranking  the 
N-best  list  from  a  state-of-the-art  parsing-based  machine  translation  system. 

6.1  Experimental  Set-up 

In  previous  work  (Gildea  and  Hofmann  1999;  Bellegarda  2000;  Chelba  2000;  Chelba 
and  Jelinek  2000;  Charniak  2001;  Roark  2001),  all  complex  language  models  have  been 
trained  on  relatively  small  data  sets.  There  is  the  impression  that  complex  language 
models  only  lead  to  better  results  than  n-grams  on  small  training  corpora.  For  example, 
Jurafsky  and  Martin  (2008,  page  482),  state,  "We  said  earlier  that  statistical  parsers  can 
take  advantage  of  longer-distance  information  than  n-grams,  which  suggests  that  they 
might  do  a  better  job  at  language  modeling/ word  prediction.  It  turns  out  that  if  we 
have  a  very  large  amount  of  training  data,  a  4-gram  or  5-gram  is  nonetheless  still  the 
best  way  to  do  language  modeling."  To  verify  whether  this  is  true,  we  have  trained  our 
language  models  using  three  different  training  sets:  one  has  44  million  tokens,  another 
has  230  million  tokens,  and  the  third  has  1.3  billion  tokens.  An  independent  test  set 
with  354k  tokens  is  chosen.  The  independent  check  data  set  used  to  determine  the 
linear  interpolation  coefficients  has  1.7  million  tokens  for  the  44  million  token  training 
corpus,  and  13.7  million  tokens  for  both  the  230  million  and  1.3  billion  token  training 
corpora.  All  these  data  sets  are  taken  from  the  LDC  English  Gigaword  corpus  with  non- 
verbalized  punctuation  and  we  remove  all  punctuation.  Table  1  provides  the  detailed  in¬ 
formation  on  how  these  data  sets  were  chosen  from  the  LDC  English  Gigaword  corpus. 
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Table  1 

The  corpora  used  in  our  experiments. 

1.3  BILLION  TOKEN  TRAINING  CORPUS 

AFP 

19940512.0003  - 

-  19961015.0568 

AFW 

19941111.0001  - 

-  19960414.0652 

NYT 

19940701.0001  - 

-  19950131.0483 

NYT 

19950401.0001  - 

-  20040909.0063 

XIN 

19970901.0001  - 

-  20041125.0119 

230  MILLION  TOKEN  TRAINING  CORPUS 

AFP 

19940622.0336  - 

-  19961031.0797 

APW 

19941111.0001  - 

-  19960419.0765 

NYT 

19940701.0001  - 

-  19941130.0405 

44  MILLION  TOKEN  TRAINING  CORPUS 

AFP 

19940601.0001  - 

-  19950721.0137 

13.7  MILLION  TOKEN  CHECK  CORPUS 

NYT 

19950201.0001  - 

-  19950331.0494 

1.7  MILLION  TOKEN  CHECK  CORPUS 
afp  19940512.0003  ~  19940531.0197 


354k  token  test  corpus 
CNA  20041101.0006  ~  20041217.0009 

These  are  selected  from  the  LDC  English  Gigaword  corpus.  AFP  =  Agence  France-Presse; 
AFW  =  Associated  Press  Worldstream;  NYT  =  New  York  Times;  XIN  =  Xinhua  News  Agency; 
and  CNA  =  Central  News  Agency  of  Taiwan  denote  the  sections  of  the  LDC  English  Gigaword 
corpus. 

The  vocabulary  sizes  in  all  three  cases  are: 

•  word  (also  WORD-PREDICTOR  operation)  vocabulary:  60k,  open — 
all  words  outside  the  vocabulary  are  mapped  to  the  (unk)  token,  these 
60k  words  are  chosen  from  the  most  frequently  occurring  words  in  the 
44  million  token  corpus; 

•  POS  tag  (also  TAGGER  operation)  vocabulary:  69,  closed; 

•  non-terminal  tag  vocabulary:  54,  closed; 

•  CONSTRUCTOR  operation  vocabulary:  157,  closed. 

The  out-of-vocabulary  (OOV)  rate  on  the  44  million,  230  million,  1.3  billion  token 
training  corpora  is  0.6%,  0.9%,  and  1.2%,  respectively.  The  OOV  rate  on  the  1.7  million 
and  13.7  million  token  check  corpora  is  0.6%  and  1.3%,  respectively.  The  OOV  rate  on 
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Table  2 

Statistics  about  the  number  of  types  of  n-grams  (n  =  3,4,5)  on  the  44  million,  230  million,  and 
1.3  billion  token  corpora. 


n= 3  n= 4  n= 5 


44  M  14,302,355  23,833,023 

230  M  51,115,539  94,617,433 

1.3  B  224,767,319  481,645,099 


29,068,173 

120,978,281 

660,599,586 


the  354k  token  test  corpus  is  2.0%.  Table  2  lists  the  statistics  about  the  number  of  types 
of  n-grams  on  these  three  corpora. 

Similar  to  SLM  (Chelba  2000;  Chelba  and  Jelinek  2000),  after  the  parse  under¬ 
goes  headword  percolation  and  binarization,  each  model  component  of  WORD- 
PREDICTOR,  TAGGER,  and  CONSTRUCTOR  is  initialized  from  a  set  of  parsed 
sentences.  We  use  the  openNLP  software2  to  parse  a  large  number  of  sentences  in  the 
LDC  English  Gigaword  corpus  to  generate  an  automatic  treebank,  which  has  a  slightly 
different  word-tokenization  than  that  of  the  manual  treebank  such  as  the  Penn  Treebank 
used  in  Chelba  and  Jelinek  (2000)  and  Chelba  (2000).  For  the  44  and  230  million  token 
corpora,  all  sentences  are  automatically  parsed  and  used  to  initialize  model  parameters, 
whereas  for  the  1.3  billion  token  corpus,  we  parse  the  sentences  from  a  portion  of  the 
corpus  that  contains  230  million  tokens,  then  use  them  to  initialize  model  parameters. 
The  parser  at  openNLP  is  trained  on  the  Penn  Treebank,  which  has  only  one  million 
tokens,  and  there  is  a  mismatch  between  the  Penn  Treebank  and  the  LDC  English 
Gigaword  corpus.  Nevertheless,  experimental  results  show  that  this  approach  is  effec¬ 
tive  to  provide  initial  values  of  model  parameters. 

6.2  Perplexity  Results 

Table  3  gives  the  perplexity  results  (Bahl  et  al.  1977)  of  n-grams  ( n  =  3,  4,  and  5)  using 
linear  interpolation  and  Rneser-Ney  (1995)  smoothing  when  the  training  corpus  has 
44  million,  230  million,  and  1.3  billion  tokens,  respectively.  We  have  implemented  a 
distributed  n-gram  with  linear  interpolation  smoothing,  but  we  don't  have  distributed 
n-grams  with  Kneser-Ney  smoothing  implemented  by  us.  Instead,  we  use  the  SRI 
Language  Modeling  Toolkit  to  obtain  perplexity  results  of  n-grams  with  Kneser-Ney 
smoothing  for  the  44  million  and  230  million  token  corpora  using  a  single  machine  that 
has  20G  memory  at  the  Ohio  Supercomputer  center.  We  are  not  able  to  compute  per¬ 
plexity  results  of  n-grams  with  Kneser-Ney  smoothing  on  the  1.3  billion  token  corpus, 
thus  we  leave  these  results  blank  in  Table  3.  From  the  results  in  Table  3,  we  decided  to 
use  a  linearly  smoothed  trigram  as  the  baseline  model  for  the  44  million  token  corpus, 
a  linearly  smoothed  4-gram  as  the  baseline  model  for  the  230  million  token  corpus,  and 
a  linearly  smoothed  5-gram  as  the  baseline  model  for  the  1.3  billion  token  corpus. 

As  we  mentioned  in  Section  3.1.1,  we  can  keep  only  a  small  set  of  topics  due  to 
the  considerations  of  computational  time  and  resource  demand.  Table  4  shows  the 
perplexity  results  and  computation  time  of  composite  n-gram/PLSA  language  models 
that  are  trained  on  the  three  corpora  when  the  pre-defined  number  of  total  topics  is  200, 
but  different  numbers  of  most-likely  topics  are  kept  for  each  document  in  PLSA;  the 


2  http: // www. codeproject . com/KB/recipes/englishparsing. aspx. 
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Table  3 

Perplexity  results  of  n- grams  ( n  =  3,  4,  and  5)  using  linear  interpolation  and  Kneser-Ney 
smoothing  when  training  set  is  a  44  million,  230  million,  or  1.3  billion  token  corpus,  respectively. 

44  M 

LINEAR 

Kneser-Ney 

n- 3 

262 

244 

w=4 

258 

235 

n= 5 

260 

235 

230  M 

LINEAR 

Kneser-Ney 

n- 3 

217 

195 

n= 4 

200 

183 

n= 5 

201 

183 

1.3  B 

LINEAR 

Kneser-Ney 

n- 3 

161 

— 

n= 4 

141 

— 

71=5 

138 

— 

Table  4 

Perplexity  (ppl)  results  and  time  consumed  of  the  composite  n-gram/PLSA  language  model 
trained  on  three  corpora  when  different  numbers  of  most-likely  topics  are  kept  for  each 
document  in  PLSA. 


CORPUS 

n 

#  OF 

TOPICS 

PPL 

TIME 

(HOURS) 

#  OF 

SERVERS 

#  OF 

CLIENTS 

#  OF  TYPES 
OF  wwZxn+1g 

44M 

3 

5 

196 

0.5 

40 

100 

120. 1M 

3 

10 

194 

1.0 

40 

100 

218. 6M 

3 

20 

190 

2.7 

80 

100 

537.8M 

3 

50 

189 

6.3 

80 

100 

1.123B 

3 

100 

189 

11.2 

80 

100 

1.616B 

3 

200 

188 

19.3 

80 

100 

2.280B 

230M 

4 

5 

146 

25.6 

280 

100 

0.681B 

1.3B 

5 

2 

111 

26.5 

400 

100 

1.790B 

5 

5 

102 

75.0 

400 

100 

4.391B 

rest  are  pruned.  For  the  composite  5-gram/PLSA  model  trained  on  the  1.3  billion  token 
corpus,  400  cores  have  to  be  used  to  keep  the  top  five  most  likely  topics.  For  the 
composite  trigram/ PLSA  model  trained  on  the  44M  token  corpus,  the  computation 
time  increases  drastically,  with  less  than  5%  percent  perplexity  improvement.  In  the 
following  experiments,  therefore,  we  keep  the  top  five  topics  for  each  document  from 
a  total  of  200  topics — all  other  195  topics  are  pruned. 

All  composite  language  models  are  first  trained  by  performing  the  N-best  list 
approximate  EM  algorithm  until  convergence,  then  the  EM  algorithm  for  a  second 
stage  of  parameter  re-estimation  for  WORD-PREDICTOR  and  SEMANTIZER  until 
convergence.  We  fix  the  size  of  topics  in  the  PLSA  to  be  200  and  then  prune  to  5  in 
the  experiments,  where  the  unpruned  5  topics  in  general  account  for  70%  probability 
in  p(g\d).  Table  5  shows  comprehensive  perplexity  results  for  a  variety  of  different 
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Table  5 

Perplexity  results  for  various  language  models  on  test  corpora,  where  +  denotes  linear  combination,  /  denotes  composite  model;  n  denotes  the  order 
of  the  w-gram,  and  m  denotes  the  order  of  the  SLM;  the  topic  nodes  are  primed  from  200  to  5. 


LANGUAGE  MODEL 

44M 

w=3,?w=2 

REDUCTION 

230M 

h=4,7w=3 

REDUCTION 

1.3B 

h=5,7w=4 

REDUCTION 

BASELINE  W-GRAM  (LINEAR) 

262 

200 

138 

W-GRAM  (KNESER-NEY) 

244 

6.9% 

183 

8.5% 

— 

— 

m-SLM 

279 

-6.5% 

190 

5.0% 

137 

0.0% 

PLSA 

825 

-214.9% 

812 

-306.0% 

773 

-460.0% 

w-GRAM+wz-SLM 

247 

5.7% 

184 

8.0% 

129 

6.5% 

w-gram+PLSA 

235 

10.3% 

179 

10.5% 

128 

7.2% 

w-gram+wz-SLM+PLSA 

222 

15.3% 

175 

12.5% 

123 

10.9% 

w-gram/wz-SLM 

243 

7.3% 

171 

14.5% 

(125) 

9.4% 

w-gram/PLSA 

196 

25.2% 

146 

27.0% 

102 

26.1% 

zw-SLM/PLSA 

198 

24.4% 

140 

30.0% 

(103) 

25.4% 

w-GRAM/PLSA+wz-SLM/PLSA 

183 

30.2% 

140 

30.0% 

(93) 

32.6% 

w-gram/wz-SLM+zzz-SLM/PLSA 

183 

30.2% 

139 

30.5% 

(94) 

31.9% 

w-gram/wz-SLM+w-gram/PLSA 

184 

29.8% 

137 

31.5% 

(91) 

34.1% 

w-gram/wz-SLM+w-gram/PLSA+zzz-SLM/PLSA 

180 

31.3% 

130 

35.0% 

— 

— 

w-GRAM/wz-SLM/PLSA 

176 

32.8% 

— 

— 

— 

— 
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models  such  as  composite  n-gram/m-SLM,  n-gram/PLSA,  m-SLM /PLSA,  their  linear 
combinations,  and  so  on,  where  we  use  on-line  EM  with  a  fixed  learning  rate  to  re- 
estimate  the  parameters  of  the  SEMANTIZER  of  test  document.  The  m-SLM  performs 
competitively  with  its  counterpart  n-gram  (n  =  m+ 1)  on  large  scale  corpus.  Table  6 
lists  the  statistics  about  the  number  of  types  in  the  predictor  of  the  m-SLMs  on  these 
three  corpora,  where  for  the  230  million  token  and  1.3  billion  token  corpora  we  cut 
off  the  fractional  expected  counts  that  are  less  than  a  predefined  threshold  of  0.005,  to 
significantly  reduce  the  number  of  the  predictor's  types  by  70%. 

In  Table  5,  for  the  composite  n-gram/m-SLM  model  (n  —  3,  m  —  2  and  n  —  4,  in  —  3) 
trained  on  44  million  tokens  and  230  million  tokens,  we  cut  off  its  fractional  expected 
counts  that  are  less  than  a  threshold  0.005;  this  significantly  reduces  the  number  of  the 
predictor's  types  by  85%.  When  we  train  the  composite  language  on  the  1.3  billion  token 
corpus,  we  have  to  both  aggressively  prune  the  parameters  of  WORD-PREDICTOR  and 
shrink  the  order  of  n-gram  and  m-SLM  in  order  to  store  them  in  a  supercomputer  having 
1,000  cores.  In  particular,  for  the  composite  5-gram/4-SLM  model,  its  size  is  too  big 
to  store,  thus  we  use  its  approximation,  a  linear  combination  of  5-gram/2-SLM  and 
2-gram/4-SLM.  For  the  5-gram/2-SLM  or  2-gram/4-SLM,  again  we  cut  off  its  fractional 
expected  counts  that  are  less  than  a  threshold  0.005,  which  significantly  reduces  the 
number  of  the  predictor's  types  by  85%.  For  the  composite  4-SLM/PLSA  model,  we  cut 
off  its  fractional  expected  counts  that  are  less  than  a  threshold  0.002,  again  this  signifi¬ 
cantly  reduces  the  number  of  predictor's  types  by  85%.  For  the  composite  4-SLM/PLSA 
model  or  its  linear  combination  with  models,  we  ignore  all  the  tags  and  use  only  the 
words  in  the  four  headwords.  We  have  checked  that  the  conditional  language  model 
(Equation  [30])  sums  to  1  for  large  randomly  selected  conditional  events.  The  compos¬ 
ite  //-gram / m-SLM /PLSA  model  gives  significant  perplexity  reductions  over  baseline 
//-grams  (n  =  3,4,5)  and  m-SLMs  (m  =  2,3,4).  The  majority  of  gains  comes  from  the 
PLSA  component,  but  when  adding  the  SLM  component  into  the  //-gram /PLSA,  there 
is  a  further  10%  relative  perplexity  reduction. 

Table  7  shows  how  large  the  composite  5-gram/PLSA,  5-gram/2-SLM  (or 
2-gram/4-SLM),  and  4-SLM/PLSA  models  are  when  trained  by  the  1.3  billion  token 
corpus  after  aggressive  pruning.  The  total  minimum  number  of  servers  used  to  store 
the  parameters  of  the  predictor  for  the  composite  5-gram/PLSA,  5-gram/2-SLM  (or 
2-gram/4-SLM),  and  4-SLM/PLSA  models  is,  respectively,  400,  240,  400,  and  the  num¬ 
ber  of  clients  to  store  the  partitioned  data  of  the  1.3  billion  token  corpus  is  100  for  these 
three  composite  language  models.  There  is  no  way  to  store  the  parameters  of  the  linear 
combination  of  the  composite  5-gram/PLSA,  5-gram/2-SLM  (or  2-gram/4-SLM),  and 
4-SLM/PLSA  models  in  our  currently  available  supercomputer  resources. 


Table  6 

Statistics  about  the  number  of  types  in  the  predictor  of  the  m-SLMs  (m  =  2, 3, 4)  on  the 
44  million,  230  million,  and  1.3  billion  token  corpora.  For  the  230  million  and  1.3  billion  token 
corpora,  fractional  expected  counts  that  are  less  than  a  threshold  are  pruned  to  significantly 
reduce  the  number  of  m-SLM  (m=3,  4)  predictor's  types  by  70%. 

m= 2  m= 3  ra= 4 


44  M  189,002,525  269,685,833  318,174,025 

230  M  267,507,672  1,154,020,346  1,417,977,184 

1.3  B  946,683,807  1,342,323,444  1,849,882,215 
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Table  7 

Counts  of  the  types  in  the  predictor  of  the  5-gram/PLSA,  5-gram/2-SLM  (or  2-gram/ 4-SLM), 
and  4-SLM/PLSA  models  when  trained  on  the  1.3B  corpus.  Fractional  expected  counts  that  are 
less  than  a  threshold  are  pruned;  this  significantly  reduces  the  number  of  predictor's  types 
by  85%. 


COMPOSITE 

MODEL 

TYPES 

OF 

#  OF 
TYPES 

#  OF 
SERVERS 

#  OF 
CLIENTS 

5-GRAM/PLSA 

wZ\zvg 

4.39  B 

400 

too 

5-GRAM/2-SLM 

wzlzvhzl 

2.01B 

240 

too 

2-gram/ 4-SLM 
4-SLM/PLSA 

W-pwhz\ 

™hz\g 

4.88  B 

400 

too 

Appendix  A  shows  an  example  of  sentence  probability  that  is  provided  by  5-gram, 
5-gram/PLSA,  and  5-gram/4-SLM+5-gram/PLSA  models,  respectively;  these  language 
models  are  trained  using  the  1.3  billion  tokens  corpus.  The  example  demonstrates  that 
our  composite  model  is  able  to  extract  topic  information  and  grammatical  structure  to 
improve  word  prediction  for  natural  language. 

Table  8  shows  the  perplexity  results  for  composite  n- gram/PLSA  and  n- gram/ 
wz-SLM/PLSA  language  models  when  three  methods  are  used  to  re-estimate  the  pa¬ 
rameters  of  the  SEMANTIZER  of  test  document;  we  use  superscript  1,  2,  and  3  to 
denote  that  during  testing  we  used  one  step  on-line  EM,  on-line  EM  with  fixed  learning 
rate,  and  batch  EM,  respectively.  The  on-line  EM  with  fixed  learning  rate  gives  the 
best  perplexity  results  as  well  as  the  least  computation  time.  Again,  when  we  train 
the  composite  language  on  the  1.3  billion  token  corpus,  we  have  to  shrink  the  order 
of  the  n-gram  and  m-SLM  in  order  to  store  them  in  a  supercomputer  having  1,000  cores. 
For  the  composite  4-SLM/PLSA  model  or  its  linear  combination  with  models,  we  ignore 
all  the  tags  and  use  only  the  words  in  the  four  headwords.  For  the  composite  5-gram/ 
4-SLM  model  or  its  linear  combination  with  models,  we  in  fact  use  its  approximation, 
a  linear  combination  of  the  5-gram/2-SLM  and  2-gram /4-SLM  models. 

To  better  explain  and  analyze  our  model,  we  mark  the  perplexity  results  for  the 
40  million  token  corpus  in  Table  5  on  the  vertices  in  Figure  3  to  reveal  many  insights. 
The  baseline  trigram  result  is  given  by  the  vertex  p(iv\zv_2w_  \ ),  the  2-SLM  result  is 
given  by  the  vertex  p(w\h_2h_i),  the  PLSA  result  is  given  by  the  vertex  p(zv\g),  the 
trigram/2-SLM  result  is  given  by  the  vertex  p{zv\w_2zv-\h-2h-i),  the  trigram/PLSA 
result  is  given  by  the  vertex  p(zv\zv_2zv_ig),  and  the  trigram/2-SLM/PLSA  is  given  by 
the  vertex  p(zv\zv-2zv_ih_2h_ig).  The  trigram+ 2-SLM  result  is  given  by  a  linear  combi¬ 
nation  of  vertices  p(zv\w_2iv_i)  and  p(zv\h_2h_i);  the  trigram+PLSA  result  is  given  by  a 
linear  combination  of  vertices  p(iv\w_2w_i)  and  p(zv\g);  and  the  trigram+2-SLM+PLSA 
result  is  given  by  a  linear  combination  of  vertices  p(iv \iv_2zv_i),  p(iv\h_2h-i),  and 
p(iv\g).  The  trigram/PLSA+2-SLM/PLSA  result  is  given  by  a  linear  combination  of 
vertices  p(zv\zv_2zv_ig)  and  p(zv\h_2h_ig),  and  so  on.  The  trigram/PLSA+trigram/ 
2-SLM+ 2-SLM /PLSA  result  is  given  by  a  linear  combination  of  vertices  p(zv\zv_2zv_ig), 
p(zv\zv_2zv_ih_2h_ig),  and  p(zv\h_2h_ig).  The  composite  trigram/2-SLM/PLSA  lan¬ 
guage  model  is  more  powerful  and  expressive  than  the  linear  combination  of  tri¬ 
gram,  2-SLM,  and  PLSA  for  two  reasons.  First,  valuable  relative  frequency  estimates 
such  as  f(zv\zv_2zv_1h_2h_ig),f(zv\zv_2zv_1h_2h_1),  and  so  forth,  are  encoded  into  the 
composite  language  model,  as  seen  from  Figure  3.  As  long  as  there  are  events  such 
as  zv_2zv_izvh_2h_ig,  and  so  on,  that  occur  explicitly  or  implicitly  in  the  training 
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Table  8 

Perplexity  results  for  the  composite  n-gram/PLSA  and  n-gram/nz-SLM/PLSA  language  models  on  the  test  corpus,  where  +  denotes  linear 
combination,  /  denotes  composite  model;  n  is  the  order  of  the  n-gram  and  m  is  the  order  of  the  SLM,  and  superscripts  1,  2,  3  denote  using  one-step 
on-line  EM,  on-line  EM  with  fixed  learning  rate,  and  batch  EM  during  testing,  respectively. 


LANGUAGE  MODEL 

44M 

n=3,m=2 

REDUCTION 

230M 

n=4,m=3 

REDUCTION 

1.3B 

n=5,jn=4 

REDUCTION 

n-GRAM  (LINEAR) 

262 

200 

138 

iz-gram/PLSA1 

202 

22.9% 

150 

25.0% 

107 

22.5% 

n-GRAM/m-SLM+n-GRAM/PLSA1 

192 

26.7% 

142 

29.0% 

(97) 

29.1% 

n-GRAM /PLS  A2 

196 

25.2% 

146 

27.0% 

102 

26.1% 

n-GRAM/w-SLM+n-GRAM/PLSA2 

184 

29.8% 

137 

31.5% 

(91) 

34.1% 

n-GRAM/PLSA3 

201 

23.3% 

148 

26.0% 

104 

24.6% 

n-GRAM/w-SLM+n-GRAM/PLSA3 

189 

27.9% 

140 

30.0% 

(92) 

33.3% 
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corpus,  the  composite  trigram/2-SLM/PLSA  will  take  them  into  account  to  improve 
the  prediction  power  for  test  data,  whereas  a  linear  combination  of  trigram,  2-SLM, 
and  PLSA  just  neglects  a  large  amount  of  this  valuable  information.  The  second  rea¬ 
son  is  that  the  weights  used  in  a  simple  linear  combination  are  context-independent, 
thus  more  restricted.  Similarly,  the  composite  trigram /2-SLM /PLSA  language  model 
is  more  powerful  and  expressive  than  a  linear  combination  of  pairwise  composite 
language  models  (e.g.,  trigram/ 2-SLM,  trigram/PLSA,  and  2-SLM/PLSA),  since  the 
composite  trigram /2-SLM /PLSA  can  take  advantage  of  the  relative  frequency  estimate 
f(zv \iv_2iv-ih_2h-ig),f('w\iv_2'w-ih-ig),  and  f(iu\w_ih_2h_ig).  The  improvement  in 
this  case  shrinks,  however,  because  pairwise  composite  language  models  use  some  valu¬ 
able  lower  order  relative  frequency  estimates  such  asf(iu \w-2W-ig),  and  so  forth.  Stated 
another  way,  each  vertex  of  the  lattice  in  Figure  3  is  an  expert  of  WORD-PREDICTOR 
that  is  proficient  in  making  a  prediction  based  on  the  context  represented  at  the  vertex;  it 
predicts  words  based  on  the  information  provided  by  a  committee  consisting  of  experts 
from  parent  vertices  as  well  as  the  relative  frequency  estimate  it  extracts.  These  experts 
are  hierarchically  organized,  with  the  WORD-PREDICTOR  of  the  composite  trigram/ 
2-SLM/PLSA  (i.e.,  p(iv\iv_2iv_ih_2h-ig))  overseeing  all  available  information  to  make 
the  most  powerful  prediction. 

Finally,  we  conducted  experiments  where  we  fixed  the  size  of  the  training  data 
and  increased  the  complexity  of  our  language  models.  Because  available  resources  are 
limited,  preventing  us  from  considering  complex  language  models  that  are  trained  on 
the  1.3  billion  token  corpus,  we  considered  complex  language  models  trained  on  the 
44  million  token  corpus  instead.  Table  9  shows  the  perplexity  results.  We  can  see  that  as 
we  increase  the  order  for  22-gram  and  222-SLM  from  n  —  3  and  m  —  2  to  n  —  4  and  in  —  3, 
the  composite  language  models  become  better  and  have  up  to  5%  perplexity  reductions; 
when  we  increase  the  order  for  22-gram  and  m-SLM  to  n  —  5  and  m  =  4,  however,  the 
composite  language  models  become  worse  and  slightly  overfit  the  data  even  if  we  use 
linear  interpolation  smoothing,  and  there  are  no  further  perplexity  reductions. 

To  summarize,  as  a  sub-problem  for  MT  and  speech  recognition  under  the  source- 
channel  paradigm  (Jelinek  2009),  language  modeling  is  a  data-rich  and  feature-rich 
density  estimation  problem  with  Kullback-Leibler  divergence  as  a  cost  function,  and 
there  is  always  a  trade-off  between  approximation  error  and  estimation  error  (Barron 
and  Sheu  1991),  reminiscent  of  the  "bias-variance"  trade-off  for  a  regression  problem 
with  a  quadratic  cost  function  (Hastie,  Tibshirani,  and  Friedman  2009).  Figure  5  explains 
the  perplexity  results  in  Tables  3  and  5  from  a  model  selection  point  of  view. 

Let  p  denote  the  true  (but  unknown)  distribution  of  natural  language,  its  infor¬ 
mation  projection  to  22-grams  is  the  minimum  Kullback-Leibler  divergence  from  p  to 
n-grams  (Amari  and  Nagaoka  2000;  Wang,  Greiner,  and  Wang  2009)  and  is  denoted  as 
pn,n  —  3, 4, 5, 6.  Let  p  denote  the  empirical  distribution  of  natural  language — in  partic¬ 
ular,  pM  denotes  the  empirical  distribution  for  a  million  token  corpus,  pg  denotes  the 
empirical  distribution  for  a  billion  token  corpus,  and  pj  denotes  the  empirical  distri¬ 
bution  for  a  trillion  token  corpus.  The  information  projection  of  pM  to  trigram  is  pj^,  to 
4-gram  is  pj^,  and  to  5-gram  is  p5u.  The  distance  between  p  and  p„ (n  —  3, 4, 5, 6),  D(p,p„), 
is  the  approximation  error  when  using  72-gram  to  represent  p,  that  is,  the  best  the  22-gram 
can  do  when  abundant  data  are  available.  The  distance  between  p"M  and  pn,n  —  3,4,5, 
D(p'/j, pn),  is  the  estimation  error  when  only  the  million  token  corpus  is  available.  The 
Pythagorean  theorem  states  that  the  distance  between  p  and  pM,  D(p,pM),  is  the  sum 
of  the  approximation  error  and  the  estimation  error  (Barron  and  Sheu  1991;  Amari  and 
Nagaoka  2000;  Wang,  Greiner,  and  Wang  2009).  In  language  modeling  research,  because 
p  is  unknown,  the  distance  between  p  and  pjh,  22  =  3, 4  is  approximately  computed 
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Table  9 

Perplexity  results  for  various  language  models  on  test  corpora,  where  +  denotes  linear  combination,  /  denotes  composite  model;  n  denotes  the  order 
of  the  77-gram,  and  m  denotes  the  order  of  the  SLM;  the  topic  nodes  are  pruned  from  200  to  5. 


LANGUAGE  MODEL 

44M 

m=3,»7=2 

REDUCTION 

44M 
n=4, 777=3 

REDUCTION 

44M 

77=5,777=4 

REDUCTION 

BASELINE  77-GRAM  (LINEAR) 

262 

258 

260 

17-GRAM  (Kneser-Ney) 

244 

6.9% 

235 

8.9% 

235 

9.6% 

m-SLM 

279 

-6.5% 

254 

1.6% 

254 

2.3% 

7/-GRAM+777-SLM 

247 

5.7% 

233 

9.7% 

234 

10.0% 

7/-GRAM+PLSA 

235 

10.3% 

230 

10.9% 

231 

11.2% 

n-GRAM+m-SLM+PLSA 

222 

15.3% 

220 

14.7% 

221 

15.0% 

77-GRAM /m-SLM 

243 

7.3% 

232 

10.1% 

235 

9.6% 

m-gram/PLSA 

196 

25.2% 

189 

26.7% 

193 

25.8% 

m-SLM/PLSA 

198 

24.4% 

190 

26.4% 

192 

26.2% 

)7-GRAM/PLSA+?w-SLM/PLSA 

183 

30.2% 

179 

30.6% 

178 

31.5% 

n-GRAM/»7-SLM+m-SLM/PLSA 

183 

30.2% 

178 

31.0% 

180 

30.8% 

n-GRAM/m-SLM+n-GRAM/PLSA 

184 

29.8% 

176 

31.8% 

178 

31.5% 

n-GRAM/m-SLM+n-GRAM/PLSA+m-SLM/PLSA 

180 

31.3% 

173 

33.0% 

173 

33.5% 

)7-gram/»7-SLM/PLSA 

176 

32.8% 

169 

34.5% 

171 

34.2% 

ON 

Cn 

NO 
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Figure  5 

Language  modeling  is  a  data-rich  and  feature-rich  density  estimation  problem.  The  information 
projection  from  true  distribution  and  empirical  distribution  to  M-grams  is  unique,  and  the 
information  projection  from  true  distribution  and  empirical  distribution  to  composite  language 
models  might  be  local  optimal.  There  is  a  trade-off  between  approximation  error  and  estimation 
error  for  composite  language  models. 


by  the  perplexity  result  using  test  data.  By  the  Glivenko-Cantelli  theorem  (Vapnik 
1998),  we  know  that  the  empirical  distribution  p  converges  to  the  true  distribution  p; 
similarly,  the  information  projection  of  empirical  distribution  on  an  H-gram  converges 
to  the  information  projection  on  an  H-gram  of  true  distribution  (i.e.,  the  estimation  error 
shrinks  to  0).  In  the  same  vein,  we  can  define  the  information  projection  of  p  or  p  to  the 
composite  language  models  and  the  corresponding  approximate  error  and  estimation 
error,  and  so  forth.  In  this  case,  the  Pythagorean  theorem  breaks  down  due  to  the  non¬ 
convexity  of  the  set  of  composite  language  models.  As  noted  by  Dr.  Ciprian  Chelba 
in  our  private  communication  on  March  20th,  2010,  "When  playing  with  large  data, 
the  model  capacity  is  an  important  factor  to  language  model  performance:  The  supply 
of  more  data  needs  to  be  matched  by  demand  on  the  model  side.  A  simple  way  to 
achieve  this  in  w-grams  is  to  increase  the  order  n  as  much  as  the  data  will  allow.  This 
of  course  implies  that  the  computational  aspects  of  storing  and  serving  such  models 
are  solved  and  that  it  is  not  a  constraint"  (see  also  Chelba  et  al.  2010).  This  is  also 
true  for  our  composite  language  models  as  justified  from  the  results  in  Tables  5  and 
9:  The  composite  H-gram / m-SLM/PLSA  language  model  has  rich  features,  thus  has 
smaller  approximation  error  than  the  H-gram,  wz-SLM,  PLSA,  or  any  composite  model 
of  two,  or  their  linear  combinations.  Table  5  shows  that  the  information  projection  of 
the  empirical  distribution  for  the  million  and  billion  token  corpora,  pM  and  pg  on  the 
composite  H-gram /m-SLM /PLSA  language  model,  is  closer  to  the  true  distribution  p. 
This  is  reflected  approximately  by  the  perplexity  results  on  test  data. 

6.3  Re-ranking  Machine  Translation  Results 

We  have  applied  our  composite  5-gram/2-SLM+2-gram/4-SLM+5-gram/PLSA1  lan¬ 
guage  model  that  is  trained  by  a  1.3  billion  word  corpus  for  the  task  of  re-ranking  the 
N-best  list  in  statistical  MT.  We  used  the  same  two  1,000-best  lists  that  were  used 
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Table  10 

10-fold  cross-validation  Bleu  score  results  for  the  task  of  re-ranking  the  1,000-best  list  generated 
on  919  sentences  of  100  documents  from  the  MT03  Chinese-English  evaluation  set. 


SYSTEM  MODEL 

MEAN  (%) 

95%  Cl  (%) 

Baseline 

31.75 

0.22 

5-gram 

32.53 

0.24 

5-GRAM/2-SLM+2-GRAM/4-SLM 

32.87 

0.24 

5-GRAM/PLSA1 

33.01 

0.24 

5-GRAM/2-SLM+2-GRAM/4-SLM+5-GRAM/PLSA1 

33.32 

0.25 

by  Zhang  and  colleagues  (Zhang,  Hildebrand,  and  Vogel  2006;  Zhang  2008;  Zhang 
et  al.  2011).  The  first  list  was  generated  on  919  sentences  of  100  documents  from 
the  MT03  Chinese-English  evaluation  set,  and  the  second  was  generated  on  191  sen¬ 
tences  of  20  documents  from  the  MT04  Chinese-English  evaluation  set,  both  by  Hiero 
(Chiang  2007),  a  state-of-the-art  parsing-based  translation  model.  Its  decoder  uses  a 
trigram  language  model  trained  with  modified  Kneser-Ney  smoothing  (Jurafsky  and 
Martin  2008)  on  a  200  million  token  corpus.  Each  translation  has  11  features  and 
language  model  is  one  of  them.  We  substitute  our  language  model  and  use  MERT 
(Och  2003)  to  optimize  the  Bleu  score  (Papineni  et  al.  2002).  We  conduct  two  ex¬ 
periments  on  these  two  data  sets.  In  the  first  experiment,  we  partition  the  first  data 
set  that  consists  of  100  documents  into  ten  pieces;  each  piece  consists  of  10  docu¬ 
ments,  nine  pieces  are  used  as  training  data  to  optimize  the  Bleu  score  (Papineni 
et  al.  2002)  by  MERT  (Och  2003),  and  the  remaining  single  piece  is  used  to  re-rank 
the  1,000-best  list  and  obtain  the  Bleu  score.  The  cross-validation  process  is  then 
repeated  10  times  (the  folds),  with  each  of  the  10  pieces  used  exactly  once  as  the 
validation  data.  The  10  results  from  the  folds  then  can  be  averaged  (or  otherwise 
combined)  to  produce  a  single  estimation  for  Bleu  score.  The  mean  and  variance  of 
the  Bleu  score  are  calculated  with  each  different  LM.  We  assume  that  the  score  follows 
Student's  t-distribution  and  we  compute  the  95%  confidence  interval  according  to 
mean  and  variance.  Table  10  shows  the  Bleu  scores  through  10-fold  cross-validation. 
The  composite  5-gram/2-SLM+2-gram/4-SLM+5-gram/PLSA1  language  model  gives 
1.57  percentage  point  Bleu  score  improvement  over  the  baseline  and  0.79  percentage 
point  Bleu  score  improvement  over  the  5-gram.  We  are  not  able  to  further  improve 
Bleu  score  when  we  use  either  the  5-gram/2-SLM+2-gram/4-SLM+5-gram/PLSA2 
or  5-gram/2-SLM+2-gram/4-SLM+5-gram/PLSA3.  This  is  because  there  is  not  much 
diversity  on  the  1,000-best  list,  and  essentially  only  20  ~  30  distinct  sentences  are  in  the 
1,000-best  list. 

In  the  second  experiment,  we  used  the  first  data  set  as  training  data  to  optimize  the 
Bleu  score  by  MERT,  then  the  second  data  set  is  used  to  re-rank  the  1,000-best  list  and 
obtain  the  Bleu  score.  To  obtain  the  confidence  interval  of  the  Bleu  score,  we  resort  to 
the  bootstrap  resampling  described  by  Koehn  (2004).  We  randomly  select  10  re-ranked 
documents  from  the  20  re-ranked  documents  in  the  second  data  set  with  replacement. 
We  draw  the  translation  results  of  the  10  documents  and  compute  the  Bleu  score.  We 
repeat  this  procedure  1,000  times.  When  we  compute  the  95%  confidence  interval,  we 
drop  the  top  25  and  bottom  25  Bleu  scores,  and  only  consider  the  range  of  26th  to  975th 
Bleu  scores.  Table  11  shows  the  Bleu  scores.  These  statistics  are  computed  with  different 
language  models,  but  on  the  same  chosen  test  sets.  The  5-gram  gives  0.51  percent¬ 
age  point  Bleu  score  improvement  over  the  baseline.  The  composite  5-gram/2-SLM+ 
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Table  11 

Bleu  score  results  for  the  task  of  re-ranking  the  1,000-best  list  generated  on  191  sentences  of  20 
documents  from  the  MT04  Chinese-English  evaluation  set. 


SYSTEM  MODEL 

MEAN  (%) 

95%  Cl  (%) 

Baseline 

27.59 

0.31 

5-gram 

28.10 

0.32 

5-GRAM/2-SLM+2-GRAM/4-SLM 

28.34 

0.32 

5-gram/PLSA1 

28.53 

0.31 

5-GRAM /2-SLM+2-GRAM/  4-SLM+5-GRAM/PLSA1 

28.78 

0.31 

2-gram /4-SLM+5-gram/PLS A1  language  model  gives  1.19  percentage  point  Bleu  score 
improvement  over  the  baseline  and  0.68  percentage  point  Bleu  score  improvement  over 
the  5-gram. 

Chiang  (2007)  studied  the  performance  of  machine  translation  on  Hiero,  the  Bleu 
score  is  33.31%  when  n-gram  is  used  to  re-rank  the  N-best  list;  the  Bleu  score  becomes 
significantly  higher  (37.09%)  when  the  n-gram  is  embedded  directly  into  Hiero's  one 
pass  decoder,  however.  This  is  because  there  is  not  much  diversity  in  the  N- best  list.  It 
is  expected  that  putting  our  composite  language  into  a  one-pass  decoder  should  result 
in  much  improved  Bleu  scores. 

Besides  reporting  the  Bleu  scores,  we  look  at  the  "readability"  of  translations, 
similar  to  the  study  conducted  by  Charniak,  Knight,  and  Yamada  (2003).  The  trans¬ 
lations  are  sorted  into  four  groups:  good /bad  syntax  crossed  with  good /bad  mean¬ 
ing  by  human  judges  (see  Table  12).  We  find  that  many  more  sentences  are  perfect, 
many  more  are  grammatically  correct,  and  many  more  are  semantically  correct.  The 
syntactic  language  model  (Charniak  et  al.  2003)  only  improves  translations  to  have 
good  grammar,  but  does  not  improve  translations  to  preserve  meaning.  The  composite 
5-gram/2-SLM+2-gram/4-SLM+5-gram/PLSA1  language  model  improves  both  signif¬ 
icantly.  Bear  in  mind  that  Charniak  et  al.  (2003)  integrated  Charniak's  language  model 
with  the  syntax-based  translation  model  proposed  by  Yamada  and  Knight  (2001)  to 
rescore  a  tree-to-string  translation  forest,  whereas  we  use  only  our  language  model 
for  N-best  list  re-ranking.  Also,  the  same  study  (Charniak  et  al.  2003)  found  that  the 
outputs  produced  using  the  n-grams  received  higher  scores  from  Bleu;  ours  did  not.  The 
difference  between  human  judgments  and  Bleu  scores  indicates  that  closer  agreement 
may  be  possible  by  incorporating  syntactic  structure  and  semantic  information  into  the 
Bleu  score  evaluation.  For  example,  semantically  similar  words  like  insure  and  ensure  as 
in  Bleu  paper  (Papineni  et  al.  2002)  should  be  substituted  in  the  formula,  and  there  is  a 
weight  to  measure  the  goodness  of  syntactic  structure.  This  modification  will  lead  to  a 
better  metric  and  such  information  can  be  provided  by  our  composite  language  models. 


Table  12 

Results  of  "readability"  evaluation  on  919  translated  sentences  of  100  documents.  P  =  perfect; 
S  =  only  semantically  correct;  G  =  only  grammatically  correct;  W  =  wrong. 


SYSTEM  MODEL 

P 

S 

G 

W 

Baseline 

95 

398 

20 

406 

5-gram 

122 

406 

24 

367 

5-GRAM /2-SLM+2-GRAM/  4-SLM+5-GRAM/PLSA1 

151 

425 

33 

310 
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In  Appendix  B,  we  give  examples  of  "perfect"  sentences,  "only  semantically 
correct"  sentences,  and  "only  grammatically  correct"  sentences. 


7.  Conclusion  and  Future  Work 

We  have  built  a  powerful  large-scale  distributed  composite  language  model  which  inte¬ 
grates  well-known  n-gram,  SLM,  and  PLSA  models  under  the  directed  MRF  paradigm. 
The  composite  language  model  has  been  trained  by  performing  a  convergent  N-best  list 
approximate  EM  algorithm  and  a  follow-up  EM  algorithm  to  improve  word  prediction 
power  on  corpora  up  to  a  billion  tokens,  and  stored  on  a  supercomputer.  We  have 
achieved  drastic  perplexity  reductions  and  obtained  significantly  better  translation 
quality  measured  by  the  Bleu  score  and  "readability"  of  translations  in  the  task  of 
re-ranking  the  N-best  list  from  a  state-of-the-art  parsing-based  MT  system.  As  far  as 
we  know,  this  is  the  first  work  building  a  complex  large-scale  distributed  language 
model  with  a  principled  approach  that  simultaneously  exploits  syntactic,  semantic, 
and  lexical  regularities  and  is  still  more  powerful  than  h -grams  trained  on  a  very 
large  corpus  with  up  to  a  billion  tokens.  It  is  reasonable  to  conjecture  that  compos¬ 
ite  language  models  can  achieve  drastic  perplexity  reduction  and  significantly  better 
translation  quality  than  n- gram  when  trained  on  Web-scale  corpora  that  have  trillions 
of  tokens. 

As  stated  in  Wang  et  al.  (2010,  p.  45),  "Since  Banko  and  Brill's  pioneering  work 
almost  a  decade  ago  (Banko  and  Brill  2001),  it  has  been  widely  observed  that  the  effec¬ 
tiveness  of  statistical  natural  language  processing  (NLP)  techniques  is  highly  suscepti¬ 
ble  to  the  data  size  used  to  develop  them.  As  empirical  studies  have  repeatedly  shown 
that  simple  algorithms  can  often  outperform  their  more  complicated  counterparts  in 
wide  varieties  of  NLP  applications  with  large  data  sets,  many  have  come  to  believe  that 
it  is  the  size  of  data,  not  the  sophistication  of  the  algorithms,  that  ultimately  play  the  cen¬ 
tral  role  in  modem  NLP  (Norvig  2008)."  It  is  true  that  'the  more  the  data,  the  better  the 
result/  a  dictum  recently  reiterated  in  a  somewhat  stronger  form  in  Halevy,  Norvig,  and 
Pereira  (2009),  but  care  needs  to  be  taken  here.  As  we  explained  in  the  last  paragraph  of 
Section  6.2,  after  we  increase  the  size  of  data,  we  should  also  increase  the  complexity 
of  the  model  in  order  to  achieve  best  results.  For  language  modeling  in  particular, 
because  the  expressive  power  of  simple  n-grams  is  rather  limited,  it  is  worthwhile 
to  exploit  latent  semantic  information  and  syntactic  structure  that  constrain  the  gen¬ 
eration  of  natural  language;  this  usually  involves  designing  sophisticated  algorithms. 
Of  course,  this  implies  that  it  takes  a  huge  amount  of  resources  to  perform  the  com¬ 
putation.  As  cloud  computing  becomes  the  dominant  platform  for  data  management 
and  information  processing  as  utility  computing,  this  will  become  feasible,  affordable, 
and  cheap. 

The  development  of  the  large-scale  distributed  composite  language  model  is  in 
its  infancy;  we  are  planning  to  deepen  our  research  and  push  this  research  in  its  limit. 
Specifically,  we  plan  to  integrate  more  advanced  topic  language  models  such  as  LDA 
(Blei,  Ng,  and  Jordan  2003)  and  resort  to  a  hierarchical  non-parametric  Bayesian  model 
(Teh  2006;  Teh  and  Jordan  2010)  for  smoothing  fractional  counts  due  to  latent  variables 
to  handle  the  sparse  data  problem  in  Kneser-Ney's  sense  in  a  principled  manner, 
thus  constructing  a  family  of  large-scale  distributed  composite  lexical,  syntactic,  and 
semantic  language  models.  Finally  we  will  put  this  family  of  composite  language 
models  into  a  phrased-based  machine  translation  decoder  (Koehn,  Och,  and  Marcu 
2003)  that  produces  a  lattice  of  alternative  translations  / transcriptions  or  a  syntax-based 
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decoder  (Chiang  2005,  2007)  that  produces  a  forest  of  alternatives  (such  integration 
would,  in  the  exact  case,  reside  in  an  extremely  difficult  complexity  class,  probably 
PSPACE-complete)  to  significantly  improve  the  performance  of  the  state-of-the-art 
machine  translation  systems. 


Appendix  A:  An  Example  of  Sentence  Probability 

We  chose  a  document  from  the  LDC  English  Gigaword  corpus  to  show  how 
sentence  probability  varies  when  computed  by  5-gram,  5-gram/ PLS A,  and  5-gram/ 
PLS A+4-SLM / PLSA .  The  document  tag  is  (XIN/ENG_20041126_0168.story).  This 
document's  perplexity  computed  by  5-gram,  5-gram+PLSA,  5-gram+4-SLM+PLSA, 
5-gram/PLSA,  and  5-gram/PLSA+4-SLM/PLSA  that  are  trained  using  1.3  billion 
tokens  corpus  is  97,  93,  83,  71,  and  64,  respectively.  We  show  the  first  four  sentences 
below. 


(s)  cpc  initiates  education  campaign  to  strengthen  members  '  wavering  convictions  (/s) 
(s)  by  zhao  lei  (/s)  (s)  beijing  nov.  'nmbr  xinhua  the  communist  party  of  china  cpc  has  decided 
to  launch  a  mass  internal  educational  campaign  from  January  next  year  to  prevent  its  members 
from  wavering  in  their  convictions  (/s)  (s)  the  decision  aiming  to  keep  the  nature  of  the  party 
members  intact  was  made  at  the  meeting  of  the  political  bureau  of  the  cpc  central  committee  on 
this  oct.  'nmbr  the  cpc ’s  top  power  organ  (/ s } . 


We  then  list  the  word  conditional  probabilities  given  its  document  history  for  the 
fourth  sentence.  The  first  line  is  the  fourth  sentence;  the  second  line  (a)  denotes  the 
natural  log  value  of  the  conditional  word  probabilities  given  its  document  history 
computed  by  5-gram;  the  third  line  (b)  denotes  the  natural  log  value  of  the  conditional 
word  probabilities  given  its  document  history  computed  by  5-gram+PLSA;  the  fourth 
line  (c)  denotes  the  natural  log  value  of  the  conditional  word  probabilities  given  its 
document  history  computed  by  5-gram+PLSA+4-SLM;  the  fifth  line  (d)  denotes  the 
natural  log  value  of  the  conditional  word  probabilities  given  its  document  history 
computed  by  5-gram/PLSA;  and  the  sixth  line  (e)  denotes  the  natural  log  value  of  the 
conditional  word  probabilities  given  its  document  history  computed  by  5-gram/ PLS A+ 
4-SLM/PLSA. 


the 

decision 

aiming 

to 

keep 

the 

nature 

the 

a. 

-2.00317 

-5.99654 

-14.9793 

-0.852055 

-4.68269 

-1.49193 

-9.84554 

-0.526566 

-0.671103 

b. 

-2.05502 

-6.08843 

-13.2655 

-0.950885 

-4.78594 

-1.56474 

-9.81423 

-0.6258 

-0.761926 

c. 

-2.05416 

-6.07556 

-13.3486 

-0.871798 

-4.69523 

-1.57311 

-9.99731 

-0.897362 

-0.829652 

d. 

-1.72696 

-5.65359 

-14.2013 

-0.99068 

-5.43248 

-1.65002 

-7.6 

-0.612751 

-0.731037 

e. 

-1.80167 

-5.73861 

-14.5548 

-0.893825 

-5.05692 

-1.60568 

-7.92909 

-0.751419 

-0.755122 

party 

members 

intact 

was 

made 

at 

the 

meeting 

of 

a. 

-6.52337 

-5.93013 

-14.992 

-5.5802 

-5.91863 

-3.47798 

-1.0155 

-3.77026 

-3.11882 

b. 

-6.48382 

-6.00924 

-13.8132 

-5.57218 

-5.98123 

-3.56856 

-1.1003 

-3.87003 

-3.14354 

c. 

-6.48696 

-5.81026 

-8.11845 

-3.04638 

-2.21191 

-2.80501 

-1.12155 

-3.85156 

-2.3551 

d. 

-3.46383 

-5.03999 

-15.242 

-5.27819 

-4.73655 

-3.03394 

-0.69443 

-3.23709 

-3.40986 

e. 

-3.80075 

-5.16911 

-8.52597 

-3.38567 

-2.54778 

-2.74127 

-0.790644 

-3.36195 

-2.64652 

the 

political 

bureau 

of 

the 

cpc 

central 

committee 

a. 

-0.619712 

-5.91994 

-1.36559 

-0.17816 

-0.217888 

-1.55966 

-0.282506 

-0.110539 

b. 

-0.710967 

-5.96757 

-1.47083 

-0.278998 

-0.313708 

-1.66454 

-0.387673 

-0.215632 

c. 

-0.636643 

-6.0839 

-1.43513 

-0.6519 

-0.634246 

-2.10113 

-0.504145 

-0.216812 

d. 

-0.475928 

-4.13345 

-0.527685 

-0.226433 

-0.204276 

-1.55903 

-0.379722 

-0.147238 

e. 

-0.475442 

-4.43649 

-0.702968 

-0.427385 

-0.388118 

-1.79781 

-0.42272 

-0.136813 
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on 

this 

oct. 

'nmbr 

the 

cpc 

's 

top 

power 

a. 

-4.33953 

-7.02792 

-10.7495 

-0.0380615 

-3.87067 

-9.93617 

-3.54366 

-4.19702 

-7.6261 

b. 

-4.37441 

-6.88172 

-10.6397 

-0.141938 

-3.65821 

-8.81816 

-3.60823 

-4.29886 

-7.64586 

c. 

-3.57338 

-6.86285 

-10.9656 

-0.131813 

-3.8662 

-8.85551 

-3.42688 

-4.28615 

-7.82392 

d. 

-4.61674 

-6.49064 

-13.0595 

-0.255452 

-3.73302 

-5.55244 

-3.60481 

-3.97708 

-7.85289 

e. 

-3.85647 

-6.61406 

-12.5666 

-0.178075 

-3.92356 

-5.90511 

-3.46416 

-4.03158 

-7.91198 

organ 

</s> 

a. 

-5.97561 

-2.62716 

b. 

-6.08022 

-2.67444 

c. 

-6.01553 

-2.65078 

d. 

-4.84265 

-2.76932 

e. 

-5.05393 

-2.70787 

The  conditional  probability  of  the  word(s)  party  or  political  bureau  given  document 
history  computed  by  5-gram/PLSA  or  5-gram /PLSA+4-SLM/PLS A  is  significantly 
boosted  due  to  the  appearance  of  semantic  related  words  such  as  cpc  and  communist 
party  in  the  previous  sentences,  this  clearly  shows  that  the  composite  language  mod¬ 
els  (5-gram/PLSA  and  5-gram /PLSA+4-SLM/PLS A)  trigger  long-span  document-level 
discourse  topics  to  influence  word  prediction.  In  contrast,  there  is  no  effect  when  using 
linear  combination  models  (i.e.,  5-gram+PLSA  and  5-gram+4-SLM+PLSA).  Similarly, 
the  conditional  probability  of  the  words  was  made  (or  the  word  intact)  given  docu¬ 
ment  history  computed  by  5-gram/PLSA+4-SLM/PLSA  is  significantly  boosted  due 
the  appearance  of  the  grammatical  headword  decision  (or  keep)  in  the  same  sentence, 
this  clearly  shows  that  the  composite  language  model  (5-gram/PLSA +4-SLM/PLSA) 
exploits  sentence  level  syntactic  structure  to  influence  word  prediction.  In  this  case,  the 
n-gram  has  to  increase  its  order  to  11  or  8.  The  linear  combination  model  5-gram+4- 
SLM+PLSA  is  quite  effective,  although  it  has  negative  impact  for  the  prediction  of 
function  words  such  as  of  the  after  the  word(s)  natural  or  political  bureau. 

Table  13  shows  the  statistics  when  n-grams  are  the  same  as  the  SLM's  WORD- 
PREDICTOR  in  the  most  likely  parse  structure  of  each  sentence  in  training  corpora. 
Whenever  the  n-grams  are  not  the  same  as  SLM's  WORD-PREDICTOR,  the  SLM  com¬ 
ponent  will  be  effective  to  furnish  sentence-level  long-range  grammatical  information. 

This  example  and  Table  13  clearly  demonstrate  that  an  n-gram  alone  is  not  able  to 
achieve  a  similar  effect  to.  SLM  and  PLSA  even  using  Web-scale  data,  and  the  directed 
MRF  paradigm  effectively  synergizes  n-gram,  nz-SLM,  and  PLSA  in  a  complementary, 
supplementary,  and  coherent  way  to  form  a  powerful  language  model  for  word  predic¬ 
tion  of  natural  language. 


Table  13 

Statistics  when  n-grams  are  the  same  as  SLM's  WORD-PREDICTOR  in  the  most  likely  parse 
structure  of  each  sentence  in  training  corpora. 


Corpus 

W_ 2 = h_ j 

w l  =  h\ 

w_\  =  h_\ 

44  M 

57% 

46% 

38% 

230  M 

59% 

46% 

38% 

1.3  B 

55% 

48% 

43% 
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Appendix  B:  Examples  of  Translation  Results 

In  the  following,  we  give  examples  of  "perfect"  sentences,  "only  semantically  cor¬ 
rect"  sentences,  and  "only  grammatically  correct"  sentences,  where  the  digit  numbers 
are  the  sentence  number  in  the  N- best  list  from  Hiero  (a)  denotes  the  reference  sentence, 
(b)  denotes  the  result  provided  by  the  composite  language  model,  and  (c)  denotes  the 
result  provided  by  5-gram. 

A  few  examples  of  "perfect"  sentences  provided  by  the  composite  language  model: 
—512— 

a.  Sri  Lanka's  Prime  Minister  Calls  on  the  People  to  Work  together  for  Permanent  Peace 

b.  Sri  Lanka  prime  minister  called  on  national  common  efforts  to  achieve  lasting  peace 

c.  Sri  Lanka  prime  minister  called  on  the  national  common  achieve  lasting  peace 

—54— 

a.  Wilner  said  the  maximum  penalty  for  securities  fraud  is  10  years  imprisonment.  However, 
the  sentence  is  expected  to  be  "significantly  shorter"  under  the  plea  deal. 

b.  wiener,  said  securities  fraud  charges  could  be  sentenced  to  10  years'  imprisonment,  according 
to  pleaded giulty  mitigation,  the  sentence  is  "shorten". 

c.  wiener,  sentenced  to  10  years'  imprisonment  maximum  securities  fraud  charges,  according  to 
pleaded  guilty  mitigation,  the  sentence  is  "shorten". 

—206— 

a.  He  said  at  a  press  conference  in  Doha,  capital  ofQarta,  that  if  the  United  States  "attacks  Iraq, 
it  may  trigger  a  global  disaster." 

b.  his  press  conference  in  doha,  capital  of  qatar,  said  "if  the  united  states  attacks  iraq,  it  zvill 
trigger  a  world  disaster". 

c.  his  press  conference  in  doha,  capital  of  qatar,  said  that  the  united  states  attacks  iraq,  "if  it  zvill 
trigger  a  zvorld  disaster". 

—249— 

a.  Some  Areas  in  Northzvest  Australia  Face  floods 

b.  floods  in  some  areas  in  the  northzvest  australia 

c.  australia  northwest  part  of  floods 

A  few  examples  of  "only  grammatically  correct"  sentences  provided  by  the  com¬ 
posite  language  model: 

—458— 

a.  Sutiyoso  said  that  gardens  andflozver  beds  zvoidd  reduce  the  impression  that  the  US  embassy 
is  a  fort. 

b.  szudy  about  zvoven  said  that  garden  landscape  could  reduce  the  us  embassy  to  a  fortress. 

c.  szudy  over  so  that  garden  landscape  can  reduce  the  u.s.  embassy  to  a  fortress. 
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—676— 

a.  He  said  that  during  Inst  Christmas  and  the  New  Year,  mainland  tourists'  spending  accounted 
for  30 

b.  during  Christmas  last  year,  he  said,  the  mainland  visitors  spending  will  account  for  a  three  to 
four  percent  of  the  kaneyuki  business  and  become  the  major  consumer  of  the  industry. 

c.  last  year,  he  said,  mainland  visitors  during  the  Christmas  spending  for  the  kaneyuki  3  to 
4  percent  of  the  business,  has  become  the  major  consumption. 

A  few  examples  of  "only  semantically  correct"  sentences  provided  by  the  composite 
language  model: 

—507— 

a.  The  famous  historic  city  of  Cologne  also  narrowly  escaped  the  disaster  in  the  heavy  rains. 

b.  cologne,  a  famous  historical  city  also  escaped  unscathed  in  the  heavy  rain. 

c.  cologne,  a  famous  historical  city  in  heavy  rain,  escaped  unscathed. 

—416— 

a.  However,  he  insisted  on  the  timetable  laid  doivn  by  Bush.  That  is  UN  only  has  "weeks  but  not 
months"  to  try  to  disarm  Iraq  peacefully  and  it  zvould  be  military  action  thereafter. 

b.  however,  he  insists  the  bush  timetable,  the  united  nations  is  "weeks  rather  than  months"  to 
urge  iraq  to  the  peace  disarm,  then  we  will  take  military  action. 

c.  hozvever,  he  insists  that  the  bush  timetable,  the  only  "weeks  rather  than  months"  to  urge  iraq 
to  the  peace  disarm,  she  went  on  to  take  military  action. 

—787— 

a.  France  circidated  its  proposals  in  the  form  of  "a  non-paper." 

b. franceis  to  distribute  their  proposals  in  the  form  of  "non  -  paper." 

c. france  is  the  form  of  "non  -  paper"  distribute  their  proposals. 

—313— 

a.  In  China,  three-quarters  of  the  1.3  billion  population  were  reported  to  have  celebrated  the  New 
Year  by  watching  television. 

b.  1.3  billion  popidation  in  china,  according  to  reports,  3  /  4  is  to  zvatch  tv  celebrate  lunar 
nezv  year. 

c.  1.3  billion  popidation  in  china,  according  to  reports,  3  / 4  is  to  celebrate  televisions. 
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